Skip to main content

Local Models

Run AI models on your own hardware - no API costs, but hardware required.

Cost Comparison

Approach	Hardware Cost	Running Cost	Quality
API (GPT-4o)	$0	$2.50/1M in	Excellent
API (Llama 3.1 70B)	$0	$0.65/1M in	Great
Local (RTX 4090)	$1,600	Electricity	Good
Local (A100)	$15,000	Electricity	Great

Hardware Requirements

Consumer GPUs

GPU	VRAM	Models	$/hour
RTX 4090	24GB	Llama 3.1 8B, Mistral 7B	$0.15
RTX 3090	24GB	Llama 3.1 8B, Mistral 7B	$0.12

Data Center GPUs

GPU	VRAM	Models	$/hour
A10G	24GB	Llama 70B (4-bit)	$1.01
A100	40GB	Llama 70B	$2.93
A100	80GB	Llama 70B (full)	$3.50

Local Inference Tools

Tool	License	Best For
Ollama	MIT	Easy setup, Mac/Linux
LM Studio	Proprietary	Windows GUI
llama.cpp	MIT	CPU + GPU, quantized
vLLM	Apache 2.0	Production, fast

Model Quantization

Format	Size Reduction	Quality Loss
FP16	1x (baseline)	None
INT8	50%	Minimal
INT4	25%	Small

Operating Costs

Home User

Component	Cost
RTX 4090 power	$0.05-0.15/kWh
8 hours/day	$1.20-3.60/month

Server (A100)

Component	Cost
A100 (1 hr)	$3.50
100 hours/month	$350
Electricity	$50
Monthly	$400

Comparison: API vs Local

Factor	API	Local
Upfront cost	$0	$1,600-20,000
Monthly cost	$50-500	$15-400
Setup time	5 min	1-10 hrs
Privacy	Data to vendor	100% local

Best For

Privacy-sensitive data
High-volume applications
Offline capabilities
Model experimentation
Cost-effective at scale

Cost Comparison
Hardware Requirements
- Consumer GPUs
- Data Center GPUs
Local Inference Tools
Model Quantization
Operating Costs
- Home User
- Server (A100)
Comparison: API vs Local
Best For