Skip to main content

RAG (Retrieval Augmented Generation)

Build knowledge-augmented AI systems with embeddings and LLMs.

RAG Stack Costs

ComponentBudget OptionMid-RangeEnterprise
EmbeddingsCohere (free tier)OpenAI small $0.02/1MOpenAI large $0.13/1M
LLMGPT-4o-mini $0.15/1MGPT-4o $2.50/1MClaude 3.5 Sonnet $3.00/1M
Vector DBPinecone (free)Pinecone $70/moWeaviate Cloud $300/mo
Storage$5/mo$20/mo$100+/mo

Embedding + LLM Combinations

EmbeddingLLMInput /1MQualityBest For
CohereGPT-4o-mini$0.52GoodBudget RAG
OpenAI smallGPT-4o-mini$0.17GreatBalanced
OpenAI smallGPT-4o$2.65ExcellentQuality
CohereClaude Sonnet$3.10ExcellentReasoning

Monthly Cost Examples

Small Knowledge Base (10K docs, 100 users)

ComponentUsageCost
Embeddings10M tokens$0.20
LLM (GPT-4o-mini)5M input + 1M output$13.50
Vector DB (Pinecone)Starter$0
Storage5GB$5
Total~$19/mo

Medium Knowledge Base (100K docs, 1K users)

ComponentUsageCost
Embeddings100M tokens$10
LLM (GPT-4o)50M input + 10M output$260
Vector DBStandard$70
Storage50GB$10
Total~$350/mo

Large Enterprise (1M docs, 10K users)

ComponentUsageCost
Embeddings1B tokens$130
LLM (Claude Sonnet)500M input + 100M output$3,000
Vector DBEnterprise$600
Storage500GB$50
Total~$3,780/mo

Cost Optimization

  1. Hybrid search - Use keyword search to reduce embedding queries
  2. Chunk optimization - Smaller chunks = fewer tokens
  3. Re-ranking - Return fewer candidates to LLM
  4. Caching - Cache frequent queries
  5. Summary indexes - Pre-compute summaries for common queries

Provider Recommendations

NeedEmbeddingLLMVector DB
BudgetCohere (free)GPT-4o-miniPinecone free
QualityOpenAI largeClaude SonnetWeaviate
SpeedCohereGemini FlashQdrant
MultilingualCohere multilingualGPT-4oPinecone