Compare pricing for models that handle multiple modalities including vision, audio, and document understanding.
Vision + Language Models
| Vendor | Model | Input /1M | Output /1M | Context | Free Tier |
|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | 128k | 100k/mo |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 128k | 100k/mo |
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 200k | Limited |
| Anthropic | Claude 3 Opus | $15.00 | $75.00 | 200k | Limited |
| Google | Gemini 1.5 Pro | $1.25 | $5.00 | 2M | 1M/mo |
| Google | Gemini 1.5 Flash | $0.075 | $0.30 | 1M | 1M/mo |
| Google | Gemini 2.0 Flash | $0.10 | $0.40 | 1M | 1M/mo |
| Meta | Llama 3.2 Vision | $0.90 | $3.60 | 127k | Local |
| Mistral | Pixtral 12B | $0.50 | $1.50 | 128k | Local |
| AWS Bedrock | Claude 3.5 Sonnet | $3.00 | $15.00 | 200k | Via AWS |
| Azure OpenAI | GPT-4o | $2.50 | $10.00 | 128k | $200 credit |
Document Understanding
| Vendor | Service | Price | Notes |
|---|
| OpenAI | GPT-4o + Vision | $2.50/1M img | Per image in document |
| Anthropic | Claude 3 | $0.25-$15/1M img | Size dependent |
| Google | Document AI | $0.05-$0.10/page | Form/table extraction |
| AWS Textract | Standard | $0.015-$0.05/page | Document extraction |
| Microsoft | Azure AI Document | $1.50-$3.50/1000 | Form recognition |
| Cohere | Document OCR | $0.10/page | Document understanding |
Audio Understanding
| Vendor | Model | Price | Notes |
|---|
| OpenAI | GPT-4o-audio | $0.03-$0.12/min | Audio + text |
| AssemblyAI | Unity | $0.05-$0.15/min | Real-time |
| Deepgram | Nova-2 Audio | $0.0043-$0.15/min | Speech recognition |
| Google | Speech-to-Text + LLM | $0.10-$0.30/min | Analysis |
Image Generation + Understanding
| Vendor | Model | Input | Output | Notes |
|---|
| OpenAI | DALL-E 3 + Caption | $0.04-$0.12 | $0.04-$0.12 | Per image |
| Google | Imagen 3 | $0.02-$0.15 | N/A | Generation only |
| Runway | Gen-3 + Analysis | $0.05-$0.15 | Per sec | + Text analysis |
All-in-One Multimodal
| Vendor | Model | Capabilities | Price | Free Tier |
|---|
| OpenAI | GPT-4o | Text, Vision, Audio, PDF | $2.50/1M | 100k |
| Google | Gemini 1.5 Pro | Text, Vision, Audio | $1.25/1M | 1M |
| Anthropic | Claude 3.5 Sonnet | Text, Vision, PDF | $3.00/1M | Limited |
| Meta | Llama 3.2 | Text, Vision | $0.90/1M | Local |
Pricing by Modality
| Provider | Low | Medium | High |
|---|
| GPT-4o | $0.00085 | $0.00425 | $0.0085 |
| Claude 3.5 | $0.00075 | $0.003 | $0.0075 |
| Gemini 1.5 Pro | $0.00125 | $0.00375 | $0.00625 |
| Provider | Price |
|---|
| GPT-4o-audio | $0.03-$0.12 |
| AssemblyAI Unity | $0.05-$0.15 |
| Deepgram | $0.0043-$0.15 |
Feature Comparison
| Model | Vision | Audio | PDF | Tool Use | Code |
|---|
| GPT-4o | Yes | Yes | Yes | Yes | Yes |
| Gemini 1.5 Pro | Yes | Yes | Yes | Yes | Yes |
| Claude 3.5 Sonnet | Yes | Via API | Yes | Yes | Yes |
| Llama 3.2 Vision | Yes | No | No | Via fine-tune | Yes |
| Mistral Pixtral | Yes | No | No | Yes | Yes |
Use Case Recommendations
| Need | Recommended |
|---|
| Document Q&A | Claude 3.5 Sonnet, GPT-4o |
| Image understanding | GPT-4o, Claude 3.5 |
| Video understanding | Gemini 1.5 Pro |
| Real-time audio | GPT-4o-audio |
| Cost efficiency | Gemini Flash, GPT-4o-mini |
| Local deployment | Llama 3.2 Vision, Pixtral |
| Mixed modalities | Gemini 1.5 Pro |
Free Tier Summary
| Service | Free Offering |
|---|
| Gemini 1.5 Flash | 1M tokens/mo |
| GPT-4o | 100k tokens/mo |
| Claude 3.5 | Limited |
| Llama 3.2 Vision | Local (free) |
| Pixtral | Local (free) |