RAG (Retrieval Augmented Generation)
Build knowledge-augmented AI systems with embeddings and LLMs.
RAG Stack Costs
| Component | Budget Option | Mid-Range | Enterprise |
|---|
| Embeddings | Cohere (free tier) | OpenAI small $0.02/1M | OpenAI large $0.13/1M |
| LLM | GPT-4o-mini $0.15/1M | GPT-4o $2.50/1M | Claude 3.5 Sonnet $3.00/1M |
| Vector DB | Pinecone (free) | Pinecone $70/mo | Weaviate Cloud $300/mo |
| Storage | $5/mo | $20/mo | $100+/mo |
Embedding + LLM Combinations
| Embedding | LLM | Input /1M | Quality | Best For |
|---|
| Cohere | GPT-4o-mini | $0.52 | Good | Budget RAG |
| OpenAI small | GPT-4o-mini | $0.17 | Great | Balanced |
| OpenAI small | GPT-4o | $2.65 | Excellent | Quality |
| Cohere | Claude Sonnet | $3.10 | Excellent | Reasoning |
Monthly Cost Examples
Small Knowledge Base (10K docs, 100 users)
| Component | Usage | Cost |
|---|
| Embeddings | 10M tokens | $0.20 |
| LLM (GPT-4o-mini) | 5M input + 1M output | $13.50 |
| Vector DB (Pinecone) | Starter | $0 |
| Storage | 5GB | $5 |
| Total | | ~$19/mo |
Medium Knowledge Base (100K docs, 1K users)
| Component | Usage | Cost |
|---|
| Embeddings | 100M tokens | $10 |
| LLM (GPT-4o) | 50M input + 10M output | $260 |
| Vector DB | Standard | $70 |
| Storage | 50GB | $10 |
| Total | | ~$350/mo |
Large Enterprise (1M docs, 10K users)
| Component | Usage | Cost |
|---|
| Embeddings | 1B tokens | $130 |
| LLM (Claude Sonnet) | 500M input + 100M output | $3,000 |
| Vector DB | Enterprise | $600 |
| Storage | 500GB | $50 |
| Total | | ~$3,780/mo |
Cost Optimization
- Hybrid search - Use keyword search to reduce embedding queries
- Chunk optimization - Smaller chunks = fewer tokens
- Re-ranking - Return fewer candidates to LLM
- Caching - Cache frequent queries
- Summary indexes - Pre-compute summaries for common queries
Provider Recommendations
| Need | Embedding | LLM | Vector DB |
|---|
| Budget | Cohere (free) | GPT-4o-mini | Pinecone free |
| Quality | OpenAI large | Claude Sonnet | Weaviate |
| Speed | Cohere | Gemini Flash | Qdrant |
| Multilingual | Cohere multilingual | GPT-4o | Pinecone |