Skip to main content

Multimodal AI

Compare pricing for models that handle multiple modalities including vision, audio, and document understanding.

Vision + Language Models

Vendor	Model	Input /1M	Output /1M	Context	Free Tier
OpenAI	GPT-4o	$2.50	$10.00	128k	100k/mo
OpenAI	GPT-4o-mini	$0.15	$0.60	128k	100k/mo
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00	200k	Limited
Anthropic	Claude 3 Opus	$15.00	$75.00	200k	Limited
Google	Gemini 1.5 Pro	$1.25	$5.00	2M	1M/mo
Google	Gemini 1.5 Flash	$0.075	$0.30	1M	1M/mo
Google	Gemini 2.0 Flash	$0.10	$0.40	1M	1M/mo
Meta	Llama 3.2 Vision	$0.90	$3.60	127k	Local
Mistral	Pixtral 12B	$0.50	$1.50	128k	Local
AWS Bedrock	Claude 3.5 Sonnet	$3.00	$15.00	200k	Via AWS
Azure OpenAI	GPT-4o	$2.50	$10.00	128k	$200 credit

Document Understanding

Vendor	Service	Price	Notes
OpenAI	GPT-4o + Vision	$2.50/1M img	Per image in document
Anthropic	Claude 3	$0.25-$15/1M img	Size dependent
Google	Document AI	$0.05-$0.10/page	Form/table extraction
AWS Textract	Standard	$0.015-$0.05/page	Document extraction
Microsoft	Azure AI Document	$1.50-$3.50/1000	Form recognition
Cohere	Document OCR	$0.10/page	Document understanding

Audio Understanding

Vendor	Model	Price	Notes
OpenAI	GPT-4o-audio	$0.03-$0.12/min	Audio + text
AssemblyAI	Unity	$0.05-$0.15/min	Real-time
Deepgram	Nova-2 Audio	$0.0043-$0.15/min	Speech recognition
Google	Speech-to-Text + LLM	$0.10-$0.30/min	Analysis

Image Generation + Understanding

Vendor	Model	Input	Output	Notes
OpenAI	DALL-E 3 + Caption	$0.04-$0.12	$0.04-$0.12	Per image
Google	Imagen 3	$0.02-$0.15	N/A	Generation only
Runway	Gen-3 + Analysis	$0.05-$0.15	Per sec	+ Text analysis

All-in-One Multimodal

Vendor	Model	Capabilities	Price	Free Tier
OpenAI	GPT-4o	Text, Vision, Audio, PDF	$2.50/1M	100k
Google	Gemini 1.5 Pro	Text, Vision, Audio	$1.25/1M	1M
Anthropic	Claude 3.5 Sonnet	Text, Vision, PDF	$3.00/1M	Limited
Meta	Llama 3.2	Text, Vision	$0.90/1M	Local

Pricing by Modality

Image Input (per image)

Provider	Low	Medium	High
GPT-4o	$0.00085	$0.00425	$0.0085
Claude 3.5	$0.00075	$0.003	$0.0075
Gemini 1.5 Pro	$0.00125	$0.00375	$0.00625

Audio Input (per minute)

Provider	Price
GPT-4o-audio	$0.03-$0.12
AssemblyAI Unity	$0.05-$0.15
Deepgram	$0.0043-$0.15

Feature Comparison

Model	Vision	Audio	PDF	Tool Use	Code
GPT-4o	Yes	Yes	Yes	Yes	Yes
Gemini 1.5 Pro	Yes	Yes	Yes	Yes	Yes
Claude 3.5 Sonnet	Yes	Via API	Yes	Yes	Yes
Llama 3.2 Vision	Yes	No	No	Via fine-tune	Yes
Mistral Pixtral	Yes	No	No	Yes	Yes

Use Case Recommendations

Need	Recommended
Document Q&A	Claude 3.5 Sonnet, GPT-4o
Image understanding	GPT-4o, Claude 3.5
Video understanding	Gemini 1.5 Pro
Real-time audio	GPT-4o-audio
Cost efficiency	Gemini Flash, GPT-4o-mini
Local deployment	Llama 3.2 Vision, Pixtral
Mixed modalities	Gemini 1.5 Pro

Free Tier Summary

Service	Free Offering
Gemini 1.5 Flash	1M tokens/mo
GPT-4o	100k tokens/mo
Claude 3.5	Limited
Llama 3.2 Vision	Local (free)
Pixtral	Local (free)

LLMs - Text-only models
Image AI - Image generation
Audio AI - Audio processing
Embeddings - For multimodal RAG

Vision + Language Models
Document Understanding
Audio Understanding
Image Generation + Understanding
All-in-One Multimodal
Pricing by Modality
- Image Input (per image)
- Audio Input (per minute)
Feature Comparison
Use Case Recommendations
Free Tier Summary
Related