Skip to main content

Multimodal AI

Compare pricing for models that handle multiple modalities including vision, audio, and document understanding.

Vision + Language Models

VendorModelInput /1MOutput /1MContextFree Tier
OpenAIGPT-4o$2.50$10.00128k100k/mo
OpenAIGPT-4o-mini$0.15$0.60128k100k/mo
AnthropicClaude 3.5 Sonnet$3.00$15.00200kLimited
AnthropicClaude 3 Opus$15.00$75.00200kLimited
GoogleGemini 1.5 Pro$1.25$5.002M1M/mo
GoogleGemini 1.5 Flash$0.075$0.301M1M/mo
GoogleGemini 2.0 Flash$0.10$0.401M1M/mo
MetaLlama 3.2 Vision$0.90$3.60127kLocal
MistralPixtral 12B$0.50$1.50128kLocal
AWS BedrockClaude 3.5 Sonnet$3.00$15.00200kVia AWS
Azure OpenAIGPT-4o$2.50$10.00128k$200 credit

Document Understanding

VendorServicePriceNotes
OpenAIGPT-4o + Vision$2.50/1M imgPer image in document
AnthropicClaude 3$0.25-$15/1M imgSize dependent
GoogleDocument AI$0.05-$0.10/pageForm/table extraction
AWS TextractStandard$0.015-$0.05/pageDocument extraction
MicrosoftAzure AI Document$1.50-$3.50/1000Form recognition
CohereDocument OCR$0.10/pageDocument understanding

Audio Understanding

VendorModelPriceNotes
OpenAIGPT-4o-audio$0.03-$0.12/minAudio + text
AssemblyAIUnity$0.05-$0.15/minReal-time
DeepgramNova-2 Audio$0.0043-$0.15/minSpeech recognition
GoogleSpeech-to-Text + LLM$0.10-$0.30/minAnalysis

Image Generation + Understanding

VendorModelInputOutputNotes
OpenAIDALL-E 3 + Caption$0.04-$0.12$0.04-$0.12Per image
GoogleImagen 3$0.02-$0.15N/AGeneration only
RunwayGen-3 + Analysis$0.05-$0.15Per sec+ Text analysis

All-in-One Multimodal

VendorModelCapabilitiesPriceFree Tier
OpenAIGPT-4oText, Vision, Audio, PDF$2.50/1M100k
GoogleGemini 1.5 ProText, Vision, Audio$1.25/1M1M
AnthropicClaude 3.5 SonnetText, Vision, PDF$3.00/1MLimited
MetaLlama 3.2Text, Vision$0.90/1MLocal

Pricing by Modality

Image Input (per image)

ProviderLowMediumHigh
GPT-4o$0.00085$0.00425$0.0085
Claude 3.5$0.00075$0.003$0.0075
Gemini 1.5 Pro$0.00125$0.00375$0.00625

Audio Input (per minute)

ProviderPrice
GPT-4o-audio$0.03-$0.12
AssemblyAI Unity$0.05-$0.15
Deepgram$0.0043-$0.15

Feature Comparison

ModelVisionAudioPDFTool UseCode
GPT-4oYesYesYesYesYes
Gemini 1.5 ProYesYesYesYesYes
Claude 3.5 SonnetYesVia APIYesYesYes
Llama 3.2 VisionYesNoNoVia fine-tuneYes
Mistral PixtralYesNoNoYesYes

Use Case Recommendations

NeedRecommended
Document Q&AClaude 3.5 Sonnet, GPT-4o
Image understandingGPT-4o, Claude 3.5
Video understandingGemini 1.5 Pro
Real-time audioGPT-4o-audio
Cost efficiencyGemini Flash, GPT-4o-mini
Local deploymentLlama 3.2 Vision, Pixtral
Mixed modalitiesGemini 1.5 Pro

Free Tier Summary

ServiceFree Offering
Gemini 1.5 Flash1M tokens/mo
GPT-4o100k tokens/mo
Claude 3.5Limited
Llama 3.2 VisionLocal (free)
PixtralLocal (free)