Self-Hosting LLMs vs API

Should you self-host an LLM or pay for a cloud API? This article compares running LLMs locally on real GPU hardware against the Claude, GPT, and Gemini APIs — model quality, hardware cost, electricity in EUR, VRAM context limits, and the failure modes nobody documents until production breaks.

The short answer for April 2026: cloud APIs win for quality-sensitive work; self-hosted GPUs win for bulk processing and privacy-constrained workloads. Running LLMs locally is cheaper at scale only when your task tolerates a ~50 ELO quality gap to Claude Sonnet and your throughput justifies the hardware purchase. Single-user interactive work is almost always cheaper on a cloud API at 2026 prices.

Pulsed Media owns its datacenter space and runs its own hardware. GPU inference cards get evaluated against the same cost-per-unit math that prices seedboxes and dedicated servers. This article has the hardware benchmarks, API pricing, and economics to show why — with full sourcing in the references section.

The quality problem

The first obstacle is not cost. It is quality.

The best open-weight model you can run on a single consumer GPU (24 GB) is Gemma 4 31B. It scores approximately 1450 on the LMSYS Chatbot Arena ELO scale (arena.ai, preliminary, April 2026). That is 20 points below Sonnet and 50 below Opus. The gap sounds small until you measure it on hard tasks.

Model	Arena ELO (approx)	Access	Hardware needed
Claude Opus 4.6	~1500	API / subscription	Cloud
Claude Sonnet 4.6	~1470	API / subscription	Cloud
GLM-5 744B (dense)	~1451	Open weights	6x 96 GB GPUs (~$48,000)
Gemma 4 31B (dense)	~1450	Open weights	1x 24 GB GPU at Q4 (~$1,600)
Kimi K2.5 1T (MoE)	~1447	Open weights	8x 96 GB GPUs (~$64,000)
Gemma 4 26B-A4B (MoE)	~1441	Open weights	1x 24 GB GPU at Q4 (~$1,600)
Qwen3-235B-A22B (MoE)	~1422	Open weights	2x 96 GB GPUs (~$16,000)
DeepSeek V3.2 685B (MoE, 37B active)	~1421	Open weights	5x 96 GB GPUs (~$40,000)
DeepSeek R1 671B (MoE, reasoning)	~1398	Open weights / API	5x 96 GB GPUs (~$40,000)
Llama 3.3 70B (dense)	~1350-1400	Open weights	1x 96 GB GPU (~$8,500)

The table splits into two tiers. Models you can run on a single GPU ($1,600-8,500) top out at ELO ~1450. Models that approach Sonnet territory (GLM-5 at 1451, Kimi K2.5 at 1447) require multi-GPU setups. The costs shown assume RTX Pro 6000 cards at ~$8,000 each via PCIe — but without NVLink, tensor parallelism runs at roughly one-third the throughput of datacenter H100s. An NVLink-equipped H100 SXM configuration running the same models costs 3-5x more ($150,000-320,000).

Gemma 4 31B scores 80.0% on LiveCodeBench v6 and 84.3% on GPQA Diamond. These are strong numbers for an open model. But on agentic workflows where models must maintain coherence across 20+ tool calls, open models fail silently: the output compiles but does the wrong thing, or the agent loops instead of converging. These failures are expensive because nobody notices until the result is wrong.

No amount of hardware spending changes this. The quality ceiling is in the model weights. Even $64,000 in GPUs running Kimi K2.5 does not reach Sonnet.

Quality vs hardware: what each GPU tier can actually run

The quality gap depends on what fits on the hardware you own.

VRAM	Best model (Q4)	Arena ELO (full precision)	Quality tier	Gap to Sonnet
4 GB	Phi-4 Mini 3.8B or smaller	~1100-1200	D	~270+ ELO
8 GB	Llama 3.1 8B (tight)	~1250-1300	C/C+	~170-220 ELO
12 GB	Qwen3 14B	~1350	B-	~120 ELO
16 GB	Qwen3 14B (comfortable)	~1350	B-	~120 ELO
24 GB	Gemma 4 31B	~1450	B+/A-	~20 ELO
32 GB	Gemma 4 31B at Q8	~1450+	A-	~20 ELO
48 GB	Llama 3.3 70B Q4	~1350-1400	B/B+	~70-120 ELO
96 GB	Llama 3.3 70B Q8	~1400	B+	~70 ELO

Arena ELO scores above were measured at full precision (BF16/FP16) on the leaderboard, not at Q4. Running at Q4_K_M typically costs 0.3-1 benchmark points (see quantization quality section), so actual Q4 performance is slightly below these numbers — within the "~" approximation for most models, but measurable on hard tasks.

At 4-8 GB VRAM, you are running models that struggle with multi-step reasoning and produce noticeably worse output than a $0.04/MTok API call to Mistral Nemo. At 24 GB, Gemma 4 closes the ELO gap but still falls short on the hardest tasks. Only at 96 GB do you get a 70B model at full quality, and even that is still B+ tier versus API frontier at A+/S.

What 4-8 GB VRAM can run

Most consumer GPUs sold in the last five years have 4-8 GB of VRAM. Can they do anything useful with local LLMs?

4 GB (GTX 1050 Ti, RX 570, integrated graphics)

At 4 GB, the only models that fit are sub-3B at Q4:

Model	Size at Q4	Speed (est.)	Useful for
Qwen3 0.6B	~0.5 GB	30-50 t/s	Text classification, simple extraction
Llama 3.2 1B	~0.8 GB	20-30 t/s	Basic summarization, translation
SmolLM2 1.7B	~1 GB	20-30 t/s	Summarization, rewriting, function calling
BitNet b1.58 2B	0.4 GB	10-20 t/s (CPU)	Classification, simple Q&A (MMLU 53)
Phi-4 Mini 3.8B	~2.3 GB	15-25 t/s	Simple coding, Q&A

These models cannot replace a general-purpose assistant: multi-step reasoning, complex code generation, and long conversations are out of reach. But they are not useless. Fine-tuned sub-2B models beat GPT-4o on structured extraction (NuExtract-tiny 0.5B) and outperform zero-shot GPT-4 on classification tasks with as few as 60-75 training examples per class (arxiv 2406.08660). For single-task pipelines — classification, extraction, routing, PII redaction — a fine-tuned 0.5-2B model on a 4 GB GPU is a legitimate production tool, not a toy.

8 GB (RTX 3060 8GB, RTX 4060, RX 6600)

At 8 GB, a 7B/8B model at Q4 fits with room for ~32K context (Q8 KV cache):

Model	Size at Q4	Speed (GPU)	Max context (Q8 KV)	Quality
Llama 3.1 8B	~4.8 GB	~40-50 t/s	~32K	C+
Qwen3 8B	~5.2 GB	~35-45 t/s	~28K	C+
Gemma 4 E4B	~5 GB	~35-45 t/s	~28K	C
Mistral Nemo 12B	~7.4 GB	~25-35 t/s	~4-8K	C+

An 8B model on an 8 GB GPU is the minimum setup that produces output comparable to budget APIs. The constraint is context: at Q4 KV cache, 64K tokens are feasible but tight. At Q8 KV (recommended for retrieval tasks), the ceiling is ~32K.

Mistral Nemo 12B fits at Q4 but leaves almost no room for KV cache. Short conversations only.

Memory bandwidth is the bottleneck

LLM token generation is memory-bandwidth-bound. The GPU reads model weights from VRAM for every single token. Read speed determines generation speed.

Platform	Memory bandwidth	Llama 3.3 70B Q4 tok/s	Relative speed
RTX Pro 6000	1,792 GB/s	~34	1.0x
H100 SXM	3,352 GB/s	~40	1.2x
Strix Halo (128 GB unified)	215 GB/s	~4.5	0.13x
DDR5 desktop (dual channel)	~89 GB/s	~2.2	0.06x
DDR4-3200 desktop (dual channel)	~42 GB/s	~1	0.03x

The Pro 6000 and H100 are close on single-card workloads because both have enough bandwidth to keep a 70B model fed. The Pro 6000 compensates with more aggressive quantization (Q4 vs FP8), fewer bytes per weight, similar throughput. The H100 pulls ahead with NVLink multi-GPU tensor parallelism, where its 900 GB/s interconnect leaves PCIe (128 GB/s) behind.

A Strix Halo system has 215 GB/s. Same model, same quantization, 8.3x less bandwidth, roughly 8x slower. A desktop CPU on DDR5 dual-channel has ~89 GB/s. 20x slower than the Pro 6000.

No amount of CPU cores or compiler flags changes this. Bandwidth is physics. The cheapest path to 1,792 GB/s in April 2026 is the RTX Pro 6000 at $8,500.

The rough formula: max tokens/sec = Memory Bandwidth (GB/s) / Model Size (GB). Actual throughput is about 50-70% of theoretical. An EPYC 9554 with ~500 GB/s measured bandwidth hits 50 t/s on an 8B Q4 model, which is 55% of the theoretical 111 t/s.

Hardware comparison

Professional GPUs

The RTX Pro 6000 is the only GPU under $10,000 that runs 70B models on a single card without quantizing below Q4. Full specs and known issues are on the dedicated page.

Model	Quantization	Tokens/sec (generation)
Llama 3.1 8B	Q4_K_M	~185
Mistral Nemo 12B	Q4_K_M	163
Qwen3 30B-A3B (MoE)	Q4_K_M	252
Llama 3.3 70B	Q4_K_M	34

34 tokens per second on 70B is about 25 words per second. Fast enough that you are not waiting for the model to finish a paragraph.

With vLLM batched serving (concurrent requests), throughput goes well beyond single-stream: 8B at 8,990 t/s, 70B at 1,031 t/s. On models that fit on one card, the Pro 6000 matches the H100 SXM at one-third the cost.

The card has real problems. A virtualization reset bug causes unrecoverable GPU state after VM shutdown. Sustained vLLM inference triggers chip resets at temperatures as low as 28C. The SM120 kernel architecture breaks DeepSeek models. Only the open-source driver works on Blackwell; there is no proprietary option.

Consumer GPUs

GPU	VRAM	Bandwidth	Llama 3.1 8B Q4 t/s	Max model (Q4)	Price (Apr 2026)
RTX 5090	32 GB GDDR7	1,790 GB/s	~300	~32B	$2,900-4,200
RTX 4090	24 GB GDDR6X	1,008 GB/s	~128	~24B	$1,600-2,200
RTX 3090	24 GB GDDR6X	936 GB/s	~87-112	~32B tight	$800-900 used
RTX 4000 SFF Ada	20 GB GDDR6	280 GB/s	~64	~13B	~$1,250

The RTX 5090 has bandwidth matching the Pro 6000 (1,790 GB/s) but only 32 GB of VRAM, which limits it to 32B-class models at Q4. For sub-32B workloads, the 5090 is better price-performance than the Pro 6000.

The RTX 3090 is the last consumer card with NVLink. Two of them ($1,600-1,800 used) give 48 GB combined, enough for 32B models with generous context. For the price of a single RTX 5090, a dual-3090 setup gets similar bandwidth and 50% more VRAM.

Gemma 4 on consumer GPUs: dense vs MoE

Gemma 4 is available in two forms that fit on 24 GB: the 31B dense model and the 26B-A4B MoE. The MoE activates only 3.8B parameters per token, making it dramatically faster.

Variant	Arena ELO	RTX 4090 speed	RTX 3090 speed	VRAM (Q4)	256K context?
31B Dense	~1450	~25-35 t/s (short ctx)	~20-35 t/s	19.6 GB	No (KV cache fills 24 GB at ~32-64K)
26B-A4B MoE	~1441	~50-129 t/s	~35-40 t/s	~15.6 GB	Yes (8.4 GB headroom for KV)

The 31B dense model has a disproportionately large KV cache (~0.85 MB per token at BF16) because of its 16 KV-head architecture. Despite fitting at Q4, context expansion rapidly consumes remaining VRAM. At VRAM-saturated configurations, the RTX 4090 drops to 7.8 t/s on the 31B.

The 26B-A4B MoE uses 4x less KV cache (~0.21 MB/token), runs 2-4x faster, and scores within 9 ELO points. For interactive use on 24 GB hardware, the MoE is the correct choice. The dense model is worth choosing only for maximum coding quality at short context (<8K tokens).

Compact workstations

The Minisforum MS-02 Ultra won a CES 2026 Innovation Award. Compact 4.8L chassis, Intel Core Ultra 9 285HX, up to 256 GB DDR5 ECC, four NVMe slots. Looks great on paper.

The catch: the PCIe slot is low-profile dual-slot only. The best GPU that fits is an RTX 4000 SFF Ada with 20 GB VRAM and 280 GB/s bandwidth. That gets ~64 t/s on an 8B model and cannot run anything above 13B. The MS-02 Ultra is a homelab machine, not an inference server.

Strix Halo

AMD's Ryzen AI Max+ 395 (Strix Halo) puts 128 GB of unified LPDDR5x-8000 memory on a single chip. The iGPU can address up to 96 GB as VRAM, matching the RTX Pro 6000 on capacity.

Bandwidth tells the real story: 215 GB/s measured, versus the Pro 6000's 1,792 GB/s. Every model runs 4-8x slower. 70B at 4.5 tokens per second works, but it is painfully slow.

Where Strix Halo is useful: running large MoE models that would not fit on a 24-32 GB consumer GPU. Qwen3 30B-A3B at 66-72 t/s in 128 GB unified memory, or larger MoE models at 20+ t/s. The 128 GB pool is the point, not the speed.

At EUR 2,000-3,000 for a complete system consuming 120W, it costs 3-5x less than an RTX Pro 6000 setup. Same model capacity, 8x less speed.

AMD Ryzen AI Halo developer platform (May 2026 announcement)

On 20 May 2026, AMD announced a first-party Ryzen AI Halo developer Mini PC built around the Ryzen AI Max+ 395 SoC. Pre-orders open June 2026 at $3,999 USD. Key specs:

SoC: Ryzen AI MAX+ 395 — 16C/32T Zen 5, Radeon 8060S iGPU (40 RDNA 3.5 CU), 50 TOPS XDNA 2 NPU, 120W TDP
Memory: 128 GB LPDDR5X-8000 unified (96 GB max iGPU VRAM allocation)
Storage: 2 TB PCIe Gen4 x4 NVMe
Form factor: 5.9" × 5.9" × 1.7" (smaller than Apple Mac Mini Pro)
I/O: 3× USB-C, Wi-Fi 7, BT 5.4, 10 GbE, HDMI 2.1b
Software: AMD ROCm 7.2.2, LM Studio, ComfyUI, VS Code Day-0 model support

AMD positions this against the NVIDIA DGX Spark ($4,679) and Apple Mac Mini M4 Pro. The chip itself is the same Strix Halo silicon already shipping in Minisforum MS-S1 Max, Beelink GTR9 Pro, and Framework Desktop systems (EUR 2,000-3,000) — what is new is AMD's own branded developer platform with optimized software stack at a higher US-market price.

A Q3 2026 successor was teased the same day: the "Gorgon Halo" platform built on the Ryzen AI MAX+ 495 with 192 GB unified memory, enabling 300B+ parameter models on a single chip. No pricing announced.

AMD's $3,999 vs cloud cost comparison: apples-to-oranges teardown

AMD's launch slide deck (slide endnote SHO-49) claims the Ryzen AI Halo "pays for itself in 6 months" against cloud API: $4,582 over 3 years versus $27,828 for cloud. The math is internally consistent but the framing buries six things that matter:

The local model is never named. The slide reports "AMD Ryzen MAX+ sustained throughput at 128K context of 36 output tokens/sec and 446 input tokens/sec" but does not state which model produced those numbers. From PM's prior benchmarking of Strix Halo, 36 t/s decode at 128K context is consistent with a 30B-class MoE (likely Qwen3 30B-A3B, ~1422 Arena ELO) — not a frontier-class model. Comparing throughput-without-quality to Claude Sonnet output is comparing apples to a cheaper, smaller orange.
36 t/s and 446 t/s are not two scenarios — they are the SAME run. Some press coverage reads these as alternate workloads, but the AMD slide explicitly labels them as "sustained decode" (output streaming) and "sustained prefill" (one-time prompt ingestion) for the same model. At 128K context, prefilling once takes ~287 seconds at 446 t/s; with KV-cache reuse on subsequent turns the prefill amortizes, without reuse every fresh context pays that cost.
The cloud baseline is Claude Sonnet 4.5 pricing, not 4.6. Sonnet 4.6 launched 5 February 2026 with adaptive thinking; Sonnet 4.5 is the prior generation. The slide cites "$3/M input, $15/M output, 10:1 ratio" which is correct Sonnet 4.5 standard-tier pricing, but the comparison ignores prompt caching (90% discount on cached input), batch API (50% discount), and the current frontier model entirely.
Subscription pricing is ignored. AMD's $773/month cloud-cost figure assumes pure per-token API. Anthropic's Claude Max ×20 subscription ($200/month) includes 600-900 million tokens — that is $1.11-$1.67 per million tokens for Sonnet-class quality, materially cheaper than per-token rates. PM operates fleet-wide on 5× Claude Max ×20 subscriptions ($1,000/month for the entire team), which is below AMD's break-even single-developer cloud cost. The "cloud is expensive" framing only works if you ignore the subscription tier the actual buyer base is already using.
The 8-hour-per-day utilization assumption is heavy. AMD models 6.3 million tokens/day (0.573M output + 5.73M input) at 8 hours of sustained AI use. Real-world developer telemetry from large Claude Code deployments shows interactive AI use averaging 1-3 hours/day of active token generation. At 2 hours/day instead of 8, AMD's break-even slips from 6 months to roughly 24+ months — still favorable for very high-volume users, irrelevant for typical single-developer use.
Electricity is modelled at 24-hour sustained draw against 8-hour usage. The $16.20/month assumes 150W × 24h × 30 days at $0.15/kWh. If the box is only inferring 8 hours/day, the cost is $5.40/month at the same rate — a minor figure, but indicative of mixed assumptions in the model.

AMD's benchmark comparison vs NVIDIA DGX Spark — +7% on GPT-OSS 120B, +12% on "Qwen 3.5 122B" (a model name that does not appear in the public Qwen3 lineup; treat as either pre-release branding, slide error, or rebadged Qwen3-A22B variant), +4% on "Qwen 3.6 35B", and +14% on "GLM 4.7 30B" — are on AMD-selected workloads at AMD-selected settings. The headline that matters is not "AMD wins by 7-14% on these slides", it is "both platforms are bandwidth-bound at ~215-273 GB/s, so both run the same ~30 t/s on 30B-class models — neither competes with a $8,500 RTX Pro 6000 at 1,792 GB/s for production inference".

What the slide gets right: for genuinely high-throughput batch agentic workloads with a 30B-class open-weight model that meets the user's quality bar, a Strix Halo box at $3,999 + $5-16/month electricity does beat per-token cloud API pricing over 36 months. The local-LLM-is-economical case exists. The slide just stretches it to imply equivalence in scenarios where the quality, context window, and pricing tier are all materially different from what an honest comparison would model.

The honest single-developer math: if you are paying $20/month for ChatGPT Plus or Claude Pro, AMD's box does not pay for itself in 6 months. It pays for itself in 200 months. If you are running an agent army at 6M tokens/day on Sonnet 4.5 per-token pricing, it pays for itself in 6 months — but you should also be evaluating Claude Max ×20 subscription at $200/month first.

CPU inference

Server-class CPUs with enough memory channels can run LLMs at usable speeds for small models:

CPU	Memory bandwidth	8B Q4 t/s	70B Q4 t/s	Usable for chat?
EPYC 9554 (64c, 8-ch DDR5)	~500 GB/s	~50	~7	8B yes, 70B batch only
Dual Xeon Gold 5317	~80 GB/s	~22	~3	8B marginal, 70B no
Desktop DDR5 (dual-channel)	~89 GB/s	~20	~2	8B marginal, 70B no
Desktop DDR4 (i5-7500T)	~38 GB/s	~4-5	—	Basic completion only

CPU inference makes sense when you need to run a model that does not fit in any available VRAM and buying a GPU is not an option. An EPYC with 12-channel DDR5 and 512+ GB RAM can run 70B at 7 t/s. Batch processing, not interactive.

The optimized fork ik_llama.cpp gets 1.8-5x faster prompt processing than mainline llama.cpp on CPUs with AVX-512, though generation speed gains are more modest.

The smallest useful CPU models

For CPU-only machines without a GPU, the models worth considering are limited:

Model	Parameters	RAM needed	Desktop CPU speed	Quality
BitNet b1.58 2B4T	2.4B	~0.4 GB	10-20+ t/s	Classification, simple Q&A; MMLU 53
SmolLM2 1.7B	1.7B	~1 GB	15-25 t/s	Summarization, rewriting; beats Llama 3.2 1B
Llama 3.2 1B	1.3B	~0.8 GB	20-30 t/s	Basic tasks; best gains from fine-tuning
Llama 3.2 3B	3.2B	~2 GB	10-15 t/s	Simple Q&A, summarization
Phi-4 Mini 3.8B	3.8B	~2.3 GB	7-10 t/s	Simple coding, reasoning
Llama 3.1 8B	8B	~4.8 GB	4-5 t/s (DDR4) / 15-20 t/s (DDR5)	C+ tier; minimum useful

Models under 3B parameters are not general-purpose assistants, but they have real production uses beyond research. SmolLM2 1.7B outperforms Llama 3.2 1B on most benchmarks. Gemma 3n E2B runs in 2 GB RAM and exceeds 1300 Elo on LMArena — the first sub-10B model to do so. Fine-tuned sub-2B models match or exceed frontier models on focused tasks like structured extraction and classification (see the 4 GB VRAM section).

Where tiny models add the most value in a self-hosted stack:

Classification and routing — a 0.5B classifier routes queries to the right model tier, saving inference cost on easy queries
Structured extraction — parsing invoices, resumes, forms into JSON
Speculative decoding — a tiny draft model proposes tokens, a large model verifies in parallel, achieving 2-3x speedup with zero quality loss (Google Research)
PII redaction — well-defined token-level task where small models excel
Edge deployment — Raspberry Pi 5 runs a 1B model at 7-15 t/s for offline classification or sensor data processing

The 8B tier on CPU is the minimum for genuinely useful general-purpose output. Below that, you are trading generality for specialization — a valid tradeoff for focused pipelines, privacy-constrained environments, and edge deployment.

The context window problem

Real workloads routinely exceed 128K tokens. A medium codebase is 200K-500K tokens. A 50-page contract is 40K tokens. An agentic session with 30+ tool calls accumulates 100K-300K tokens. Document collection analysis, multi-session memory, and RAG over large corpora push past 500K easily.

Gemini 2.5 Pro handles 1 million tokens natively. Claude Opus 4.6 and Sonnet 4.6 handle 1M. GPT-4.1 takes 1M. xAI's Grok 4.20 takes 2M. These are the context windows where real work happens.

Open models top out at 128K. That is not a VRAM limitation — it is a training limitation. No open model available in April 2026 was trained on context beyond 128K-256K, and quality degrades well before the stated maximum. This is the single largest gap between self-hosted and API inference, and no amount of hardware closes it.

VRAM limits on context

Every token in context consumes VRAM for its KV cache entry. The formula: 2 x layers x KV_heads x head_dim x bytes_per_element. For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim), that is ~128 MB per 1,000 tokens at FP16. Q8 KV halves that to ~64 MB/1K with negligible quality loss. Q4 KV halves again but degrades retrieval accuracy.

Trained context windows cap the usable range regardless of VRAM:

Model	Trained context	Notes
Llama 3.1 / 3.3	128K	Best open 70B; hard ceiling at 128K
Qwen3	128K	YaRN-extended; quality degrades past training range
Gemma 4	128K
Mistral Nemo	128K
Gemini 2.5 Pro (API)	1,000K	8x the best open model
Claude Opus 4.6 / Sonnet 4.6 (API)	1,000K	8x the best open model
GPT-4.1 (API)	1,000K	8x the best open model
Grok 4.20 (xAI API)	2,000K	16x the best open model

Every open model stops at 128K. All four frontier API providers — Anthropic, Google, OpenAI, and xAI — now offer 1M-2M token context windows. For workloads that routinely exceed 128K — codebase analysis, long document chains, agent sessions — this gap is not solvable with more VRAM. The model simply was not trained for it.

All context numbers below use Q8 KV cache (recommended — negligible quality loss vs FP16, half the VRAM). Q4 KV doubles these limits but degrades retrieval accuracy. Models at Q4_K_M weights throughout.

VRAM	Model	Max context (Q8 KV)	Max context (Q4 KV)	Notes
8 GB	8B	~36K	~73K	Minimum useful setup
8 GB	14B+	Does not fit
12 GB	8B	~112K	128K (trained limit)	128K with Q4 KV only
12 GB	14B	~43K	~86K
16 GB	8B	128K (trained limit)	128K (trained limit)	Full context at Q8 KV
16 GB	14B	~93K	128K (trained limit)
24 GB	8B	128K (trained limit)	128K (trained limit)	Full context
	14B	128K (trained limit)	128K (trained limit)	Full context at Q8 KV
	32B	~35K	~70K	Short context only
48 GB	14B	128K (trained limit)	128K (trained limit)	Full context
	32B	128K (trained limit)	128K (trained limit)	Full context at Q8 KV
	70B	~46K	~92K	Short-to-medium context
80 GB	32B	128K (trained limit)	128K (trained limit)	Full context
80 GB	70B	~125K	128K (trained limit)	Full context at Q4 KV; near-full at Q8
96 GB	70B	128K (trained limit)	128K (trained limit)	Full context; best single-GPU for 70B
128 GB (Strix Halo / CPU)	70B	128K (trained limit)	128K (trained limit)	Full context; ~4.5 t/s (slow)

KV cache per token at Q8: 8B = 64 MB/1K, 14B = 80 MB/1K, 32B = 128 MB/1K, 70B = 160 MB/1K. Q4 halves these. 1 GB headroom reserved for runtime overhead.

For CPU/RAM inference, the same arithmetic applies but "VRAM" becomes "available RAM after OS and model." A 64 GB DDR4 system running 8B Q4 (4.7 GB model + OS overhead) has roughly 55 GB available for KV cache — same capacity as a 48 GB GPU, but 10-15x slower due to DDR4 bandwidth (~42 GB/s vs GPU's 900+ GB/s).

Context rot: accuracy degrades with length

Even when the VRAM fits, model accuracy drops as context grows. The Chroma 2025 study tested 18 models and found 20-50% accuracy degradation from 10K to 100K tokens across every model tested, including frontier APIs. Open models degrade faster.

The worst-case position is the middle of the context window. Beginning and end receive stronger attention (the "lost in the middle" effect, Liu et al. 2024). For retrieval tasks where the target information can appear anywhere, this creates systematic blind spots.

On harder retrieval tasks requiring ordered extraction across long context (Sequential-NIAH, arXiv 2504.04713, EMNLP 2025), the best model tested scored 63.5%. Standard "find the passkey" benchmarks show 90-100% at 128K; the realistic number for semantic retrieval is 50-65%. This is a model capability ceiling, not a hardware limitation.

Non-literal retrieval is worse still. The NoLiMa benchmark (ICML 2025) tested 13 LLMs on tasks requiring semantic similarity matching rather than exact text lookup. Open-source models showed inverted U-shaped performance curves beyond critical context thresholds — accuracy degrades substantially when the task requires reasoning about meaning rather than matching strings.

KV cache quantization compounds these losses. FP16 and Q8 KV cache produce no measurable retrieval degradation. Q4 KV cache adds detectable accuracy loss on top of context rot, particularly on retrieval tasks. For any workload where finding information in context matters, Q8 KV cache is the minimum — Q4 saves VRAM at the cost of making the context less reliable.

TurboQuant: KV cache compression (ICLR 2026)

TurboQuant (Google Research, ICLR 2026) compresses KV cache to 2.5-3.5 bits per channel with near-zero quality loss. This is not weight quantization — it is complementary to Q4_K_M model weights. You can run a model at Q4 weights AND TurboQuant 3-bit KV cache simultaneously.

KV cache method	Bits/channel	LongBench score	Needle retrieval	VRAM per 1K tokens (8B model)
FP16 (baseline)	16	50.06	0.997	128 MB
Q8 (current standard)	8	~50.06	~0.997	64 MB
TurboQuant	3.5	50.06	0.997	~28 MB
TurboQuant	2.5	49.44	—	~20 MB
Q4 (current)	4	degrades	degrades	32 MB
KIVI 3-bit	3	48.50	0.981	~24 MB

At 3.5 bits, TurboQuant matches FP16 quality exactly — identical LongBench score, identical needle retrieval. At 2.5 bits, it outperforms KIVI at 3 bits. The method is data-oblivious: no calibration data, no per-model tuning. It works by applying a random orthogonal rotation that makes vector coordinates near-independent, then using mathematically optimal scalar quantizers.

If TurboQuant 3-bit KV becomes standard, the VRAM context table above roughly doubles: an 8B model on 24 GB at 3-bit KV would reach 128K comfortably instead of needing Q8 at 128K. A 70B model on 96 GB at 3-bit KV would push past 200K context.

Status (April 2026): No official Google implementation. Community ports exist for llama.cpp (TQ3_0 format, not merged), vLLM (Triton kernels, not merged), and PyTorch/HuggingFace. None are in mainline frameworks yet. Official integration expected Q2-Q3 2026.

Hardware cost to match API context

API capability	Best local equivalent	Hardware cost	What you actually get
GPT-4.1 at 128K (A- quality)	8B Q4 on RTX 4090	~EUR 1,800	128K context but C+ quality — 2 tiers below
GPT-4.1 at 128K (A- quality)	70B Q4 on 2x RTX 4090	~EUR 3,500	128K context at B+ quality — 1 tier below
Claude 1M (A+ quality)	70B Q4 on A100 80GB	~EUR 15,000	Capped at 128K (872K shorter), B+ quality
Claude 1M (A+ quality)	Nothing	—	No open model trained past 128K
Gemini 1M (A+ quality)	Nothing	—	Not achievable locally at any price

The gap is stark beyond 128K. No combination of hardware and open models reaches 200K context. All four frontier API providers now offer 1M-2M at A+ quality. Self-hosted tops out at 128K at B+ quality. For workloads that need the full context — codebase analysis, long document processing, extended agent sessions — APIs are the only option.

Working around the 128K ceiling

Several techniques exist to process more data than fits in a single context window. None of them are transparent substitutes for native long context.

RAG (Retrieval-Augmented Generation) is the most production-ready approach. Split documents into chunks, embed them in a vector database, retrieve only the relevant chunks per query. Quality reaches 70-85% of native long context for retrieval tasks (EMNLP 2024), but fundamentally cannot do cross-document reasoning — connecting a clause on page 40 to a definition on page 3 requires both to be in context simultaneously. RAG turns the problem from "LLM reasons over all data" into "retrieval system selects data, LLM reasons over selection." The retrieval quality becomes the ceiling.

Context compression (LLMLingua, Microsoft) removes unimportant tokens to fit more into the same window. At 2-4x compression, quality loss is minimal. At 10-20x, degradation is measurable. A 128K window with 4x compression gives roughly 512K effective tokens — useful, but still below 1M native context and with quality loss on compressed portions.

Map-reduce / chunking processes each chunk independently, then combines results. Works for summarization (60-80% quality). Fails for reasoning that requires cross-chunk information.

Hierarchical summarization — summarize chunks, then summarize summaries — works for gist but fails for precision. Each summarization round is lossy, and hallucination amplification compounds: a fabricated fact in round 2 becomes "established fact" in round 3.

StreamingLLM solves a different problem. It enables continuous generation over arbitrarily long streams by preserving "attention sink" tokens, but the model can only see the current window. Information outside the window is gone. Useful for long chat sessions, not for document analysis.

Technique	Production ready?	Quality vs native long context	Real substitute?
RAG	Yes	70-85% (retrieval) / poor (reasoning)	Partial — good for search, poor for synthesis
LLMLingua compression	Yes	85-95% at 2-4x compression	Partial — extends window modestly
Map-reduce / chunking	Yes	60-80% depending on task	No — loses cross-chunk connections
Hierarchical summarization	Yes (for summaries)	50-70% for detail; good for gist	No — lossy compression compounds
StreamingLLM	Yes (for chat)	N/A (different problem)	No — does not extend reasoning context

The LaRA benchmark (ICML 2025) tested 2,326 cases across multiple tasks and concluded there is no silver bullet — the best approach depends on task type, model capabilities, and retrieval characteristics. One concrete data point: switching from chunked medical records to full-context patient histories improved diagnostic accuracy by 23% because the model could see temporal patterns across years of data.

The "lost in the middle" problem (Liu et al., TACL 2024) complicates even native long context: models perform best when relevant information is at the beginning or end of the context, with 30%+ performance degradation for information in the middle. This affects all models, including those designed for long context.

For self-hosters, the practical hierarchy is: if the workload fits within 128K, local models are competitive. If it needs 128K-512K, RAG or compression can partially bridge the gap with measurable quality penalties. Beyond 512K, APIs with native 1M context are the only reliable option.

API pricing (April 2026)

API pricing dropped roughly 80% since early 2025. Budget-tier models now cost $0.02-0.40 per million tokens.

Tier	Model	Input $/MTok	Output $/MTok	Context	Quality
Budget	Mistral Nemo	$0.02	$0.04	128K	C+
Budget	GPT-4.1-nano	$0.05	$0.20	1M	C+
Budget	GPT-4.1-mini	$0.16	$0.64	1M	B-
Mid	DeepSeek V3.2	$0.28	$0.42	128K	B+
Mid	Gemini 2.5 Flash	$0.15	$3.50	1M	B+
Mid	GPT-4.1	$2.00	$8.00	1M	A-
Mid	Grok 4.1 Fast	$0.20	$0.50	2M	B+
Mid	Grok 4.20	$2.00	$6.00	2M	A
Mid	Claude Haiku 4.5	$1.00	$5.00	200K	A-
Frontier	Claude Sonnet 4.6	$3.00	$15.00	1M	A+
Frontier	Gemini 2.5 Pro	$1.25	$10.00	1M	A+
Frontier	Claude Opus 4.6	$5.00	$25.00	1M	S
Frontier	OpenAI o3	$2.00	$8.00	200K	S (reasoning)

Subscription tiers exist for heavy users: Claude Pro at $20/month, Claude Max at $100-200/month with higher rate limits. Google, OpenAI, and others have similar tiered access. The per-token rates above are the pay-as-you-go ceiling, not the effective cost for regular users.

Cached input pricing drops costs further. Claude cache hits cost 0.1x the base rate. Gemini 2.5 Flash cached input is $0.03/MTok. Workloads with repeated context (RAG, agent loops) benefit from caching.

The cost comparison

Self-hosted at 100% utilization

Best case for self-hosting: a GPU running batch inference around the clock.

Setup	Model	Monthly cost (3yr amort.)	Tokens/month	EUR/MTok	Quality
RTX Pro 6000	70B vLLM batched	~EUR 380	~2.7B	EUR 0.14	B/B+
RTX Pro 6000	8B vLLM batched	~EUR 380	~23B	EUR 0.017	C+
2x RTX 3090	8B single-stream	~EUR 130	3.7B	EUR 0.48	C+
Strix Halo	70B single-stream	~EUR 90	~13M	EUR 0.50	B/B+

All costs in EUR. RTX Pro 6000 at ~$8,500 = ~EUR 7,800 at April 2026 exchange rates. 3-year amortization includes hardware + electricity at EUR 0.15/kWh.

At 100% utilization, self-hosted 8B on an RTX Pro 6000 matches Mistral Nemo API pricing (EUR 0.034/MTok) while running the model locally. Self-hosted 70B at EUR 0.14/MTok undercuts DeepSeek V3.2 API (EUR 0.39/MTok) for comparable B+ output.

The utilization trap

Nobody runs a single-operator GPU at 100%. You ask questions during working hours. The GPU idles overnight, on weekends, during meetings. Realistic utilization for one person or a small team: 10-30%.

Utilization	RTX Pro 6000 70B EUR/MTok	vs DeepSeek V3.2 API (EUR 0.39)
100%	EUR 0.14	Self-hosted 2.8x cheaper
50%	EUR 0.28	Self-hosted 1.4x cheaper
20%	EUR 0.70	API 1.8x cheaper
10%	EUR 1.40	API 3.6x cheaper

At 20% utilization, self-hosted 70B costs EUR 0.70/MTok for B+ quality output. The same money buys Gemini 2.5 Flash at B+ quality via API, with no hardware to maintain and 1M token context.

The GPU does not know you went to lunch. It depreciates whether it generates tokens or not.

Hardware depreciation

The RTX Pro 6000 costs ~EUR 7,800 today. When NVIDIA ships Rubin (next-generation architecture, expected ~2027), the Pro 6000 will be worth roughly EUR 3,700-4,600. Three years out: EUR 2,300-2,800. GPU depreciation runs 40-60% over 2-3 years.

API subscriptions depreciate at 0%. Every month brings the latest models. No replacement planning, no resale.

The lease option (EUR 500/month for 18 months, EUR 9,000 total) costs more than buying outright.

Quantization

Quantization compresses model weights to use less memory and run faster, at some quality cost.

70B model

Quantization	Bits/weight	70B model size	Quality vs FP16	Speed vs FP16	Minimum GPU
FP16 (full precision)	16.0	~140 GB	100%	1.0x (baseline)	2x H100 SXM
Q8_0	8.0	~70 GB	~99%	~1.8-2.0x	1x RTX Pro 6000
Q6_K	6.5	~54 GB	~99%	~2.3-2.5x	1x RTX Pro 6000
Q5_K_M	5.5	~48 GB	~96-97%	~2.5-3.0x	1x RTX Pro 6000
Q4_K_M (common default)	4.8	~40 GB	~96-99% (model-dependent)	~3.0-3.5x	1x RTX Pro 6000
Q3_K_M	3.4	~33 GB	~90-95%	~3.5-4.0x	2x RTX 3090
Q3_K_S	3.0	~28 GB	~85-92%	~4.0-4.5x	2x RTX 4060 Ti 16GB
Q2_K (emergency)	2.7	~27 GB	~75-85%	~4.0-5.0x	1x RTX 5090
AQLM 2-bit (trained codebooks)	2.0	~18 GB	~95-98%	GPU-only	1x RTX 4090
BitNet 1.58-bit (hypothetical)	1.58	~14 GB	unknown at 70B	CPU-native	1x RTX 3060 (no model exists)

8B model (Llama 3.1 8B)

Quantization	Bits/weight	8B model size	Quality vs FP16	Minimum GPU
FP16	16.0	~16 GB	100%	1x RTX 4090 (24 GB)
Q8_0	8.0	~8.5 GB	~99%	1x RX 5700 XT (8 GB, tight)
Q6_K	6.5	~6.3 GB	~99%	1x RX 5700 XT (8 GB)
Q5_K_M	5.5	~5.7 GB	~98-99%	1x RX 5700 XT (8 GB)
Q4_K_M (common default)	4.8	~4.7 GB	~96-99%	1x GTX 1060 6GB
Q3_K_M	3.4	~3.7 GB	~85-90%	1x GTX 1050 Ti (4 GB, tight)
Q2_K	2.7	~3.1 GB	~75-85%	1x GTX 1050 Ti (4 GB)

Gemma 4 26B-A4B (MoE)

Mixture-of-experts models store all 26B parameters but only activate 4B per token. Storage follows total parameter count; inference speed follows active parameter count.

Quantization	Total size (26B stored)	Active per token (4B)	Minimum GPU	Effective speed
FP16	~52 GB	~8 GB	1x RTX Pro 6000	Baseline
Q8_0	~26 GB	~4 GB	1x RTX 5090 (32 GB)	~2x
Q4_K_M	~16 GB	~2.4 GB	1x RTX 4060 Ti 16GB	~3-3.5x
Q2_K	~8 GB	~1.3 GB	1x RX 5700 XT (8 GB)	~4-5x, quality loss

At Q4_K_M, the full Gemma 4 MoE fits on a 16 GB card with room for KV cache, while only reading ~2.4 GB of active weights per token. That makes it extremely fast: the RTX Pro 6000 hits 252 t/s on a similar MoE architecture (Qwen3 30B-A3B).

What quantization actually costs in quality

The degradation curve has two phases. From Q8 through Q4_K_M, quality loss is measurable but rarely perceptible. Below Q4, quality collapses non-linearly.

Quant	Size reduction	Perplexity loss	Benchmark avg (Llama 3.1 8B)	Verdict
Q8_0	47%	+0.01%	69.41 (vs 69.47 FP16)	Indistinguishable from FP16
Q6_K	59%	+0.06%	69.23	Negligible loss
Q5_K_M	64%	+0.2%	69.36	Blind testers cannot distinguish from FP16
Q4_K_M	69%	+0.7%	69.15	Recommended default — 0.32 points below FP16
Q3_K_M	75%	+3.3%	68.07	Usable but noticeable on reasoning tasks
Q3_K_S	77%	+5.7%	65.49	Math drops from 77.6 to 68.3 (GSM8K)
Q2_K	82%	+22%	—	Extreme quality loss; perplexity climbs non-linearly
IQ2_XXS	85%	+18%	—	Catastrophic on small models (7% accuracy vs 42% at Q4)

Sources: llama.cpp Discussion #2094 (canonical perplexity), arxiv 2601.14277 (full benchmark table, Jan 2025), GFMath (IQ2_XXS catastrophe on ~70 models).

A blind test with 500+ votes on Mistral 7B confirmed: human evaluators cannot distinguish Q5_K from FP16, but consistently identify IQ1_S as worse.

What the quality loss looks like in practice: At Q8 and Q5, output is indistinguishable from full precision — same word choices, same reasoning chains, same code. At Q4_K_M, output reads identically on most prompts; edge cases in math and multilingual tasks occasionally produce different (slightly worse) answers. At Q3, reasoning tasks start showing wrong intermediate steps — the model "almost" gets it but takes a wrong turn more often. At Q2_K and below, output visibly degrades: repetitive phrasing, lost coherence on longer responses, math errors on problems the full-precision model solves correctly, and noticeably worse code generation. The blind test data matches: humans cannot spot the difference until around Q3, and confidently identify degradation at IQ1_S.

Tasks hurt differently. Coding and STEM degrade most from quantization (IJCAI 2025). Multilingual loses 15-20% at Q4 (ionio.ai). Instruction following (IFEval) drops >10% at Q4 on some model families. Red Hat's 500K evaluations on Llama 3.1 show >99% recovery at 8-bit and 96-99% at 4-bit with calibrated methods.

Model family matters: Qwen2.5 is the most quantization-tolerant family. LLaMA 3.3 is the most fragile — not recommended at Q4 or below.

Bigger quantized beats smaller full precision

If VRAM is the constraint, always choose the larger model at lower precision over a smaller model at full precision. The data is clear:

Configuration	Perplexity	VRAM	Model family
13B at Q4	5.41	~8 GB	Llama 1
7B at FP16	5.96	~14 GB	Llama 1
65B at Q2_K	4.10	~20 GB	Llama 1
33B at FP16	4.16	~66 GB	Llama 1

These numbers are from the original Llama 1 family (llama.cpp canonical perplexity data). The principle holds across newer model families, though quantization tolerance varies: Qwen2.5 is the most tolerant, LLaMA 3.3 the most fragile.

The 13B at Q4 uses half the VRAM of the 7B at FP16 and produces better output. The 65B at Q2_K matches the 33B at FP16 in one-third the memory. Parameter count beats precision until you hit extreme quantization (below Q3), where both advantages erode.

All four major 4-bit methods (GPTQ, AWQ, EXL2, GGUF Q4_K_M) achieve nearly identical quality (perplexity 4.31-4.36 on Llama-2-13B). The differences are speed (EXL2 fastest) and VRAM efficiency (GPTQ most efficient), not quality.

Extreme quantization: ternary weights and sub-2-bit

All of the above methods are post-training quantization, compressing an FP16 model after it has been trained. A different approach trains the model with quantized weights from the start.

BitNet b1.58 (Microsoft Research / Tsinghua University) constrains every weight to one of three values: {-1, 0, +1}. That is log2(3) = 1.58 bits per weight. Multiplications become additions. A 2B parameter model fits in 0.4 GB instead of ~4 GB at FP16, roughly 10x smaller.

The only production-quality open model is bitnet-b1.58-2B-4T (2 billion parameters, 4 trillion training tokens, Apache 2.0 license). Community models exist at 8B but are experimental.

Model	MMLU (5-shot)	Size	Speed (i7-13800H)	EUR/MTok (15W laptop)
BitNet b1.58 2B	53.17	0.4 GB	~34 t/s	EUR 0.006
Llama 3.2 1B (FP16)	49.30	~2.5 GB	~21 t/s	EUR 0.010
Qwen2.5 1.5B (FP16)	60.25	~3 GB	~15 t/s	EUR 0.014

The energy savings are real: 71-82% less energy than FP16 on x86, 55-70% on ARM. These numbers come from Microsoft's bitnet.cpp paper and have been verified across multiple sources. The inference framework (bitnet.cpp) is production-ready on CPU, with GPU support added in May 2025.

There is one catch that matters for anyone evaluating this: speed and energy benefits only appear when using bitnet.cpp. Loading the model in standard HuggingFace transformers gives you the smaller memory footprint but none of the ternary kernel speedups. You cannot use Ollama, LM Studio, or any other llama.cpp frontend for native BitNet inference as of April 2026.

Post-training methods can also push below 2 bits. AQLM (Yandex/IST Austria, ICML 2024) compresses Llama 2 7B to 2.02 bits with a WikiText2 perplexity of 6.59, versus 5.47 at FP16, roughly 2-5% task quality degradation. QTIP (NeurIPS 2024 Spotlight) achieves similar quality at 2-bit with >3x inference speedup. These are GPU-only methods: they reduce VRAM usage, not power draw.

Does extreme quantization change the economics?

At first glance, 10x smaller models should flip the self-hosting math. A 2B model in 0.4 GB running on a 15W laptop at 25 tokens per second costs about EUR 0.006 per million tokens in electricity, 6x cheaper than Mistral Nemo API.

The problem is quality. BitNet 2B scores MMLU 53. Mistral Nemo 12B scores around 70. These are not the same tier of model. A 2B model, regardless of quantization method, cannot follow complex instructions, write useful code, or maintain coherence over long conversations. Saving EUR 0.03 per million tokens while getting dramatically worse output is not a savings.

The real promise is running larger models on cheaper hardware. A hypothetical 70B BitNet model would need ~18 GB instead of 140 GB, fitting on a single RTX 3090. But the model zoo has not caught up to the technique. Microsoft's verified model is 2B. PrismML released Bonsai 8B in March 2026, claiming 1.15 GB footprint and 368 t/s on an RTX 4090, but these are vendor claims and independent benchmarks are still sparse. Until larger ternary models are independently verified, extreme quantization changes what is theoretically possible without changing what you can actually run today.

Software

Running a model locally in 2026 is straightforward. The tooling is good.

Ollama wraps llama.cpp behind a REST API with model management. Install, pull a model, start chatting.

llama.cpp is the foundation. CUDA, Vulkan, Metal, CPU with AVX-512/AMX. GGUF quantization format.

ik_llama.cpp is an optimized fork of llama.cpp with rewritten CPU kernels. 1.8-5.2x faster prompt processing on AVX2/AVX-512 CPUs. Same GGUF models, drop-in replacement. Use this instead of mainline llama.cpp for CPU-bound workloads. Benchmarks in the CPU comparison section.

vLLM handles concurrent serving with PagedAttention and continuous batching. Use this when serving multiple users from one GPU.

bitnet.cpp is required for native ternary inference. Separate from llama.cpp. CPU-primary, with CUDA support since May 2025. Requires Clang 18+ to build.

The cheapest hardware myth

A common misconception: "I already own the hardware, so inference is free." It is not. Electricity costs money, and consumer GPUs draw 230-415W under inference load. Even ignoring the purchase price entirely, the electricity to generate tokens locally costs more per token than calling a budget API.

Proof: idle desktop CPU

Take an Intel i5-7500T in an HP ProDesk 400 G3 Desktop Mini. A machine that costs EUR 150 used and sits idle most of the day. DDR4-2400 dual-channel gives ~38 GB/s theoretical memory bandwidth. At 50% efficiency with 4 threads, that translates to roughly 4-5 tokens per second on an 8B Q4 model.

The system draws about 22W idle and 45W under inference load. The marginal 23W for inference costs EUR 2.48 per month in electricity at Finnish rates (EUR 0.15/kWh).

Model	Size on disk	Tokens/sec (est.)	Tokens/month (24/7)	EUR/MTok (electricity only)	vs Mistral Nemo API
Qwen3 0.6B Q4	~0.4 GB	~40	103.7M	EUR 0.024	1.5x cheaper, but trivial tasks only
BitNet b1.58 2B (ternary)	0.4 GB	~15-25	51.8M	EUR 0.048	1.3x more expensive, D-tier quality
TinyLlama 1.1B Q4	~0.7 GB	~28	72.6M	EUR 0.034	Break-even, basic completion only
Bonsai 8B (1-bit, unverified)	1.15 GB	~15	38.9M	EUR 0.064	1.7x more expensive, claims 8B quality
Phi-4 Mini 3.8B Q4	~2.2 GB	~7.6	19.7M	EUR 0.126	3.4x more expensive
Llama 3.1 8B Q4	~4.7 GB	~4-5	11.7M	EUR 0.212	5.7x more expensive

BitNet b1.58 2B runs via bitnet.cpp (not llama.cpp). The ternary weights reduce memory reads, but the i5-7500T's 4 cores with AVX2-only still cap throughput around 15-25 t/s. Bonsai 8B speed is estimated from memory bandwidth; PrismML's claimed 368 t/s is on an RTX 4090 GPU. Independent CPU benchmarks for Bonsai are sparse as of April 2026.

At the same C+ quality tier (8B model), the API costs EUR 0.037 per million tokens. The i5-7500T costs EUR 0.212. The API is 5.7 times cheaper on electricity alone, before counting the computer's purchase price.

The extreme quantization models do not change the math. BitNet 2B at EUR 0.048/MTok is 1.3x more expensive than the API and produces D-tier output (MMLU 53 vs Mistral Nemo's ~70). The only models where electricity beats the API are sub-1B parameter models at Q4, and their output quality is so far below Mistral Nemo that the comparison is meaningless.

CPU comparison: DDR4 to DDR5

The i5-7500T is the cheapest case. How do faster CPUs compare? Memory bandwidth is the bottleneck — more channels, faster memory, more tokens per second.

CPU	Memory	Bandwidth (theoretical)	Cores	System idle/load	Llama 3.1 8B Q4 t/s (est.)	EUR/MTok (marginal electricity)
Intel N100	DDR5-4800 1-ch	38.4 GB/s	4E	8W / 16W	~3	EUR 0.111
Intel i3-N305	DDR5-4800 1-ch	38.4 GB/s	8E	10W / 22W	~3.5	EUR 0.143
Intel i5-7500T	DDR4-2400 2-ch	38.4 GB/s	4C/4T	22W / 45W	~4-5	EUR 0.212
Intel i5-8500T	DDR4-2666 2-ch	42.7 GB/s	6C/6T	22W / 48W	~5-6	EUR 0.180
AMD Ryzen 5 5600X	DDR4-3200 2-ch	51.2 GB/s	6C/12T	45W / 95W	~6-7	EUR 0.297
AMD Ryzen 9 5900X	DDR4-3200 2-ch	51.2 GB/s	12C/24T	55W / 120W	~6-8	EUR 0.338
AMD Ryzen 9 7950X	DDR5-5200 2-ch	83.2 GB/s	16C/32T	65W / 145W	~10-12	EUR 0.277

Electricity rate: EUR 0.15/kWh. EUR/MTok calculated from marginal power (load minus idle) at estimated token rate, 24/7 operation. N100 and i3-N305 are single-channel memory, which caps bandwidth despite DDR5 speeds. The Ryzen 7950X is 2-3x faster than the i5-7500T thanks to DDR5 dual-channel (83 GB/s vs 38 GB/s).

Every CPU in this table costs 3x to 9x more per token in electricity than the Mistral Nemo API (EUR 0.037/MTok). The fastest desktop CPU tested (Ryzen 7950X) still costs 7.5x more. More cores do not help — dual-channel DDR4 or DDR5 memory bandwidth is the wall.

One exception to the "llama.cpp is llama.cpp" assumption: ik_llama.cpp, an optimized fork, achieves 1.8-5.2x faster prompt processing than mainline llama.cpp on the same hardware. Benchmarks on a Ryzen 7950X (16 threads, AVX2):

Quantization	ik_llama.cpp (t/s)	llama.cpp (t/s)	Speedup
BF16	256.9	78.6	3.3x
Q8_0	268.2	147.9	1.8x
Q4_0	273.5	153.5	1.8x
IQ3_S	156.5	30.2	5.2x
Q8_K_R8	370.1	N/A	new format

These are prompt processing (prefill) speeds, not generation. Generation improves a more modest 1.0-2.1x. The key finding: on the Ryzen 7950X, the slowest quantization type in ik_llama.cpp is faster than the fastest type in mainline llama.cpp for prompt processing. For CPU inference workloads that involve large prompts — document summarization, RAG extraction, batch classification — the fork makes a material difference. It does not change the electricity cost comparison (the watts are the same), but it reduces wall-clock time per job, which matters for throughput-limited batch work.

Proof: every consumer GPU loses

The pattern holds across dedicated GPUs. Inference power draw runs about 55-75% of the card's TDP (the decode phase is memory-bandwidth-bound, not compute-bound). Add 120W for the rest of the system.

GPU	VRAM	Llama 3.1 8B Q4 tok/s	System watts	EUR/MTok (electricity)	vs API
RX 5700 XT	8 GB	~55 (Vulkan, no ROCm)	266W	EUR 0.201	5.4x
RX 6700 XT	12 GB	~45	270W	EUR 0.250	6.8x
RX 6900 XT	16 GB	~60	315W	EUR 0.219	5.9x
RTX 3060	12 GB	~42	230W	EUR 0.228	6.2x
RTX 3090	24 GB	~87	345W	EUR 0.165	4.5x
RTX 4060 Ti 16GB	16 GB	~48	227W	EUR 0.197	5.3x
RTX 4090	24 GB	~128	413W	EUR 0.134	3.6x
RX 7900 XTX	24 GB	~80	351W	EUR 0.183	4.9x

API baseline: Mistral Nemo at EUR 0.037/MTok ($0.04/MTok output). Electricity rate: EUR 0.15/kWh. RX 5700 XT runs Vulkan only (ROCm dropped Navi 10 support); speed scaled from llama.cpp Vulkan scoreboard data.

Every consumer GPU costs 3.6 to 6.8 times more in electricity per token than the API. The RTX 4090, the fastest consumer card, still costs 3.6x more. At German electricity rates (EUR 0.30/kWh), it costs 7.2x more.

The ex-mining AMD cards are fast for their price but still lose to the API. The RX 5700 XT draws 266W system power for ~55 tokens per second — its wide 256-bit memory bus (448 GB/s) makes it the fastest sub-$200 used card. The RX 6900 XT draws 315W for ~60 t/s: 18% more power for 9% more speed over the 5700 XT, because both share a 256-bit bus and differ mainly in clock speed.

For the best consumer GPU (RTX 4090) to break even with Mistral Nemo API on electricity alone, the electricity rate would need to be EUR 0.041/kWh. That is below any residential rate in Europe.

Enterprise GPUs: faster but still losing

Enterprise GPUs have HBM memory with 2-4x the bandwidth of consumer GDDR. They are faster per token but draw more power and sit in servers with higher base consumption. System power assumes +200W for a server chassis (vs +120W for a desktop).

GPU	VRAM	Bandwidth	8B Q4 tok/s (est.)	System watts	EUR/MTok (electricity)	vs API	Purchase price (Apr 2026 est.)
L40S	48 GB GDDR6	864 GB/s	~110	550W	EUR 0.208	5.6x	$7,500-9,000
A100 40GB PCIe	40 GB HBM2e	1,555 GB/s	~195	450W	EUR 0.096	2.6x	$5,000-8,000 (used)
A100 80GB SXM	80 GB HBM2e	2,039 GB/s	~260	600W	EUR 0.096	2.6x	$12,000-18,000
RTX Pro 6000	96 GB GDDR7	1,792 GB/s	~279 (7B bench)	720W	EUR 0.107	2.9x	$8,000-9,200
H100 SXM	80 GB HBM3	3,352 GB/s	~400	900W	EUR 0.094	2.5x	$22,000-30,000

The H100, the fastest GPU in this comparison, still costs 2.5x more per token in electricity than the API. The L40S is worse than consumer GPUs per token because its 864 GB/s bandwidth is slower than the RTX 4090's 1,008 GB/s but draws more system power in a server chassis.

Enterprise GPUs close the gap but do not flip the math. Where they earn their price is batched serving: an H100 running vLLM with 32 concurrent users achieves aggregate throughput that amortizes the power draw across requests. Single-stream single-user inference, which is what this table measures, is the worst case for expensive hardware.

What this means

Hardware cost, depreciation, driver maintenance, cooling, and noise all come on top of electricity. If electricity alone already loses to the API, the total cost is not getting better.

Self-hosting has real advantages: privacy, zero network latency, offline capability, fine-tuning. Those are legitimate reasons to run local inference. "It is cheaper" is not one of them.

Fine-tuning: where self-hosting wins decisively

Fine-tuning is the strongest capability advantage for self-hosting. A fine-tuned 7B model routinely beats a generic 70B+ model on the specific task it was trained for (arxiv 2406.08660, Stanford). Multi-task fine-tuned Phi-3-Mini (3.8B) surpassed GPT-4o on financial benchmarks (ACM/OpenReview). A fine-tuned 350M model beat ChatGPT by 3x on structured tool calling.

What you need

Component	Minimum	Recommended
Dataset	50-100 examples (classification)	500-2,000 examples (generation tasks)
GPU (QLoRA)	RTX 4060 Ti 16 GB (up to 7B)	RTX 4090 24 GB (up to 14-20B)
GPU (full fine-tune)	A100 80 GB (7B)	4x A100 (32B)
Time per run	1-2 hours (7B, QLoRA)	3-6 hours (14B, QLoRA)
Framework	LLaMA-Factory (beginner, web UI)	Unsloth (2x speed, 70% less VRAM)

QLoRA is the key: it reduces VRAM requirements by 4-8x versus full fine-tuning with minimal quality loss. A 7B model that needs 60-80 GB for full fine-tuning needs only 5-10 GB with QLoRA. Quality is within 1-2% of full fine-tuning on most tasks.

500 carefully curated examples outperform 5,000 noisy ones. The LIMA paper demonstrated that 1,000 well-crafted examples could produce a competitive instruction-following model. Quality of training data matters more than quantity.

API fine-tuning is limited

Provider	Models available for fine-tuning	Export weights?	Pricing
OpenAI	GPT-4.1, 4.1 Mini, 4o Mini	No	$0.80-3.00/M training tokens
Google	Gemini Flash/Pro (Vertex AI only)	No	Per compute-hour
Anthropic	Claude 3 Haiku only (AWS Bedrock only)	No	AWS compute pricing
xAI	None	N/A	N/A

No API provider lets you export model weights. Your fine-tuned model only works through their API, at their pricing, subject to their deprecation schedule. Self-hosted fine-tuning produces weights you own, deploy anywhere, and keep forever.

API fine-tuning also cannot do: LoRA merging (combine multiple adapters), DPO/GRPO (preference alignment), model merging (TIES/DARE/SLERP), continued pretraining, or knowledge distillation. These techniques are only available with local hardware.

Cost break-even

API costs below include both fine-tuning compute (training runs at $0.80-3.00/MTok) and inference on the fine-tuned model (at elevated fine-tuned model rates). Self-hosted costs include hardware amortization (RTX 4090, 3 years) plus electricity.

Usage pattern	API cost (1 year, training + inference)	Self-hosted cost (1 year)	Break-even
Occasional (1 training run/month, light inference)	~EUR 150-300	~EUR 1,700 (hardware + power)	~8-12 months
Moderate (weekly retraining, daily inference)	~EUR 2,000-5,000	~EUR 1,750	~4 months
Heavy (daily retraining, continuous inference)	~EUR 15,000-50,000	~EUR 2,000	2-4 weeks

The inference cost is the driver. Self-hosted inference on a fine-tuned model costs only electricity after hardware; API inference on a fine-tuned model costs $0.80-3.00/M tokens input, forever. At moderate-to-heavy inference volume, the hardware pays for itself quickly.

When NOT to fine-tune

Fine-tuning teaches behavior and domain patterns, not reasoning ability. A fine-tuned 7B cannot match GPT-4 on general reasoning. It can match or beat GPT-4 on the specific narrow task it was trained for.

If knowledge changes frequently, use RAG instead — retrieval stays current without retraining. If factual accuracy with source citation is paramount, RAG provides provenance that fine-tuning cannot. The RAFT study (UC Berkeley) found that combining fine-tuning for behavior with RAG for knowledge outperforms either approach alone.

Catastrophic forgetting is real: fine-tuning performance and general knowledge loss have a strong inverse relationship. LoRA/QLoRA reduces this (fewer parameters modified) but does not eliminate it.

When self-hosting makes sense

If data cannot leave your premises (regulatory, air-gapped, classified), API access is off the table. Self-hosted is the only option and cost comparisons do not apply.

Bulk classification, embedding generation, and document processing with a 7-8B model on modest hardware is cheap and fast. A used RTX 3090 ($800-900) runs Llama 3.1 8B at 87-112 t/s depending on configuration. These tasks do not need frontier quality.

Genuine 24/7 batch workloads keeping a GPU above 80% utilization get real per-token savings. This applies to organizations running inference as a service, not individuals using LLMs as assistants.

Fine-tuning on proprietary data is the clearest self-hosting win — see the section above. A $1,600 RTX 4090 with QLoRA handles models up to 14-20B, and the resulting model is yours to deploy anywhere.

Needle-in-a-haystack retrieval and semantic search over large private document collections are where local inference earns its cost. RAG pipelines that embed and search hundreds of thousands of documents generate millions of tokens per day in embedding and extraction queries. These are high-volume, low-quality-bar tasks where an 8B model is sufficient and API costs would compound. A single GPU running embeddings at 80-100% utilization processes the volume at a fraction of API pricing because the per-token overhead is amortized over sustained throughput.

Embedding models are often the strongest case for local deployment. Embedding models (22M-334M parameters) are distinct from generative LLMs — they produce vector representations for semantic search and RAG, not text. They run on any CPU at hundreds to thousands of embeddings per second with negligible power draw and 1-2 GB RAM total.

Model	Parameters	Use case	Speed (CPU)	RAM
all-minilm-l6-v2	22M	fast filtering, similarity	~5,000 embed/sec	<1 GB
nomic-embed-text	274M	semantic search, RAG	~1,000 embed/sec	~1 GB
mxbai-embed-large	334M	high-quality embeddings	~500 embed/sec	~1.5 GB

A 22M embedding model running on any desktop CPU powers local semantic search over thousands of documents with zero API cost. For infrastructure use cases — log analysis, document similarity, RAG retrieval — embedding models deliver more production value than 1-4B generative models. They are the one category where "I already own the hardware" is genuinely true: the compute is trivial, the electricity is negligible, and the privacy benefit is real. API-based embedding (OpenAI ada-002 at $0.10/MTok) adds up fast at scale; local embedding is effectively free after the one-time setup.

Document summarization at scale follows the same pattern. Summarizing 10,000 PDFs through an API costs real money at $0.04-3.00 per million tokens depending on the model tier. On local hardware, the marginal cost is electricity and the model runs as fast as bandwidth allows. A Ryzen 7950X with 64 GB RAM can summarize documents at ~10 t/s on an 8B model continuously without per-request billing. An RTX 3090 does the same at ~87 t/s.

Model stability: no silent changes

APIs ship updates you do not see. Between the February 2026 launch of Claude Opus 4.6 and late March 2026, Anthropic shifted default behavior in ways that measurably degraded output quality for heavy users. The 5 February 2026 Opus 4.6 launch made adaptive thinking the recommended mode, letting the model decide its own reasoning budget per turn. Anthropic's own documentation states plainly: "At the default effort level (`high`), Claude almost always thinks. At lower effort levels, Claude may skip thinking for simpler problems," and warns in the tuning section that "Steering Claude to think less often may reduce quality on tasks that benefit from reasoning." Claude Code harness defaults shifted during this window, and trade press reported peak-hour session-cap tightening in late March. Each change appeared in changelogs or third-party reporting. None was broadcast to users as a quality change. The combined effect was indistinguishable from a silent quality regression.

What Anthropic actually promises

In its September 2025 engineering postmortem, Anthropic states the narrow, auditable commitment: "We never reduce model quality due to demand, time of day, or server load." That is a specific promise about one mechanism class — load-based quality throttling. It is not a promise that user-perceptible experience is stable. The same postmortem disclosed three concurrent infrastructure bugs affecting Sonnet 4, Opus 4.1, Opus 4, and Haiku 3.5 between 5 August and 18 September 2025: a context-window routing error that peaked at 16% of Sonnet 4 requests on 31 August, an output-corruption bug that produced Thai and Chinese characters in English responses between 25 August and 2 September, and an approximate top-k XLA:TPU miscompilation affecting Haiku 3.5 from 25 August onward. All three were infrastructure problems, not deliberate model changes. None were externally visible as defects until Anthropic published the postmortem.

Anthropic's own documentation also contains a striking admission about the Opus 4.7 rollout. The adaptive thinking documentation notes that the `thinking.display` default changed from `summarized` to `omitted` when Opus 4.7 shipped, and describes the change in its own words: "This is a silent change from Claude Opus 4.6, where the default was `summarized`." The provider itself uses the phrase "silent change" to describe a user-visible behavior shift between model versions at the same identifier pattern. This is the structural issue, not Anthropic-specific malice: API defaults drift between releases, and the only way a customer learns is by reading every changelog entry against every model version they depend on.

Evidence, measured and debunked

Telemetry from 6,852 Claude Code sessions filed in GitHub issue #42796 by AMD engineer Stella Laurenzo measured a 73% collapse in visible reasoning depth between January and March 2026 (median thinking chars falling from ~2,200 to ~600), a Read:Edit tool-call ratio falling from 6.6 to 2.0, and an 80x increase in deduplicated API request volume. Concurrent session scale-up explains roughly 5-10x of that multiplier; the remaining 8-16x tracks degradation-induced thrashing — retries, corrections, and wrong outputs that would not have been needed had the model reasoned properly the first time. Input-token volume rose 170x between February and March (120M to 20.5B), estimated daily cost rose 122x ($12 to $1,504), reasoning loops per thousand tool calls rose 156%, and user-interrupt events per thousand tool calls rose 556%. Anthropic shipped Opus 4.7 within two weeks of the report gaining traction, and raised the default Claude Code harness effort level to `xhigh`.

Not every "nerf proof" holds up. The most-shared visual evidence — a BridgeBench comparison showing Opus scoring 83% before the alleged nerf and 68% after — failed basic methodological review: the first run measured 6 tasks and the retest measured 30 tasks, so the two scores were not comparing like with like. On the overlapping six tasks, the before-and-after numbers were 87.6% and 85.4% — within normal run-to-run variance. The viral result was noise amplified by confirmation bias. Rumors persist that providers silently quantize weights during peak load — dropping Opus to Int4 or 1.58-bit precision, or compressing the KV cache, to save compute on expensive models. No weight probe, perplexity study, or latency fingerprint has been published showing quantization artifacts, and the symptoms heavy users describe match reasoning-budget cuts and harness bloat, not the perplexity spikes that characterize aggressive quantization. The quantization theory remains unverified speculation — but it is unfalsifiable without provider-side access, which is precisely the problem. The operator of a hosted API has the ability to do this; the customer has no way to prove they did or did not.

The silent-downgrade bug class

A separate cluster of complaints — "I paid for Opus, I got Sonnet" or "I got Haiku" — turns out to be a real, documented bug class, but not a policy. GitHub issue #19468 and its duplicates (#3434, #6602, #13242, #17966) document client-side failures in the Claude Code harness: falling back to Sonnet after an Opus quota cap, serving Sonnet despite explicit Opus configuration, configuration files being ignored after harness updates, and reauthentication silently reverting the selected model to Haiku. These are implementation defects in the client, not backend model substitution. The user experience is identical to a hidden downgrade, which is what makes the deception theory persistent even when the mechanism is actually a bug. The transparency failure is real; the malice is not established.

The cross-provider pattern

This pattern is not Anthropic-specific. Chen, Zaharia, and Zou (2023) measured GPT-4 falling from 84% to 51% on prime-number identification between March and June 2023 on identical prompts — controlled snapshot comparison, published in a peer-reviewed preprint. OpenAI acknowledged GPT-4 Turbo "laziness" in December 2023 and shipped a fix in January 2024. Independent researchers found GPT-4 producing roughly 5% shorter responses when prompted with a December date vs. a May date, suggesting the model had internalized holiday-slowdown patterns from training data. Gemini 2.5 Pro faced identical community complaints in Q4 2025 and Q1 2026 — hallucinations, timeouts, degraded reasoning — with users speculating Google was starving Pro of TPU capacity to make room for 3.0 (unconfirmed). The underlying causes need not require malice: task-difficulty ratchet as users push harder problems, novelty effect wearing off, RLHF regression, silent product-configuration changes, infrastructure bugs, and selection bias among power users who post loudly when things feel worse. The universality is Bayesian evidence against any specific vendor being uniquely deceptive. It is also Bayesian evidence that any API-dependent operation will eventually experience this.

Reproducibility is structural

A self-hosted model with a given SHA256 hash produces the same output distribution today, in a year, and in five years. The reasoning budget is whatever VRAM and latency tolerance you allocate, not what a classifier at the provider decides under peak load. There is no adaptive-thinking mis-budgeting, no prompt-cache bug inflating cost 10-20x, no 1M context recall degrading past roughly 48% utilization as documented in GitHub issue #35296, no silent client-side downgrade from Opus to Sonnet after a quota hit, no silent display-default flip between model versions, no auto-compact summarising away the context you needed, and no rumored peak-hour quantization that cannot be confirmed or denied. Changes happen when you update the weights file. Reverting is copying the old file back. Pinning to a specific checksum is a guarantee the provider cannot make.

For API-dependent workloads that cannot move to local inference, the practical mitigations are narrow: set `effort` to `high` or `max` explicitly rather than relying on defaults, pin `thinking.type` and `thinking.display` explicitly rather than letting model-version upgrades change them silently, treat 256K–400K as the practical context limit even on advertised 1M windows, and document the exact model snapshot (e.g. `claude-opus-4-6-20260201`) your pipeline was validated against. Priority Tier for production workloads gives predictable quota behavior. Cross-provider redundancy reduces single-vendor dependency. None of these mitigations restore byte-identical reproducibility — they buy you fewer surprises, not zero.

For workloads where reproducibility matters more than frontier quality — regulated industries with audit requirements, long-horizon research spanning model versions, agent pipelines where behavior changes break production — model stability is a feature APIs structurally cannot match. It does not show up in cost-per-token comparisons. It compounds over years.

When APIs win

For coding, multi-step reasoning, and agentic workflows, APIs win on both quality and effective cost.

A B+ local model looks adequate on simple prompts. The gap shows on hard problems: multi-file refactoring, agent coherence past 20 turns, judgment calls in ambiguous situations. The output compiles. It does the wrong thing. These are the tasks where LLMs are worth using, and where frontier models justify their pricing.

At 20-30% utilization, APIs often cost less per token than self-hosted inference. Add maintenance (driver bugs, cooling, kernel compatibility, GPU replacement) and the gap widens.

The DeepSeek V3 case study

DeepSeek V3.2 API costs about EUR 260/month for a moderate coding workload (130K queries/month averaging ~5K tokens each — typical for code completion, short Q&A, and lightweight agent calls). At heavier usage with long context (e.g. 74K tokens per query), the same 130K queries would cost ~EUR 2,500/month. Self-hosting the full DeepSeek V3 model requires 8x H100 SXM GPUs.

Configuration	Upfront cost	Monthly operating	vs API (EUR 260/mo, moderate)	vs API (EUR 2,500/mo, long-context)
DeepSeek V3 on 8x H100 (new)	EUR 320K	EUR 9,600	37x more	3.8x more
DeepSeek V3 on 8x H100 (used)	EUR 200K	EUR 6,260	24x more	2.5x more
DeepSeek V3 on 2x MI300X	EUR 37K	EUR 1,230	5x more	0.5x (cheaper)
Qwen3-30B-A3B on 2x RTX 4090 (substitute)	EUR 7,600	EUR 350	35% more (lower quality)	7x cheaper (much lower quality)

At moderate usage (EUR 260/month), self-hosting loses badly. At heavy long-context usage (EUR 2,500/month), self-hosting on MI300X starts to break even — but you are running the full 685B model, which requires specialized hardware. The consumer GPU substitute (Qwen3-30B-A3B) is a different, smaller model with lower quality.

The API is almost certainly priced below marginal cost for individual users. DeepSeek benefits from massive scale, strategic pricing, and Chinese electricity rates. At moderate usage, competing on price by buying your own hardware is a losing proposition. At heavy long-context usage, self-hosting can break even on operating costs — but the upfront hardware investment (EUR 37K-320K for the full model) means years to payback.

API pricing is variable: pay for what you use. GPU hardware is a fixed cost that depreciates regardless. Workloads that vary week to week waste less money on APIs.

At Pulsed Media

Pulsed Media runs its own hardware in owned datacenter space in Finland. GPU inference cards like the RTX Pro 6000 go through the same depreciation, power cost, and rack space analysis as any other piece of infrastructure.

For most workloads in 2026, cloud APIs deliver better quality per euro than local inference. Bulk classification, embeddings, and privacy-constrained tasks are where the GPU math works. 16 years of hardware purchasing says: buy when the math supports it, use the API when it does not.

Seedboxes and dedicated servers from Pulsed Media run on the same Finnish datacenter infrastructure, with the same cost-per-unit discipline applied to every piece of hardware. See current plans.

References

Benchmarks and performance data

LMSYS Chatbot Arena ELO Leaderboard — model quality rankings used for Arena ELO scores throughout this article
llama.cpp Vulkan GPU Scoreboard — community-submitted GPU inference benchmarks (Llama 2 7B Q4_0, tg128). Source for RX 5700 XT (70.73 t/s), GTX 1050 Ti (20.96 t/s), RTX 3060 (75.94-80.59 t/s), and other consumer GPU speeds
llama.cpp — the inference engine behind Ollama and most local LLM benchmarks
vLLM — batched serving engine; source for Pro 6000 throughput numbers (8B at 8,990 t/s, 70B at 1,031 t/s)

Model cards and technical reports

Gemma 4 model family — 31B dense and 26B-A4B MoE variants. Arena ELO ~1450 (31B) and ~1441 (26B-A4B)
Llama 3.3 70B — the 70B dense model used for Pro 6000 and consumer GPU benchmarks
Llama 3.1 8B — the 8B model used as baseline for electricity cost and bandwidth comparisons
Qwen3 model family — 0.6B through 235B; 30B-A3B MoE benchmark source
Phi-4 Mini 3.8B model card — MMLU 67.3%, GSM8K 88.6%
BitNet b1.58 2B-4T model card — MMLU 53.2%, energy and latency benchmarks
bitnet.cpp — native ternary inference framework

Research papers

bitnet.cpp: Efficient Edge Inference for Ternary LLMs (Microsoft Research, 2024) — energy per token and CPU latency measurements for BitNet models
AQLM: Extreme Compression of LLMs via Additive Quantization (ICML 2024) — 2-bit quantization with 2-5% quality degradation
QTIP: Quantization with Trellises and Incoherence Processing (NeurIPS 2024 Spotlight) — 2-bit quantization with 3x+ inference speedup
Chroma 2025 study — context rot: 20-50% accuracy degradation from 10K to 100K tokens across 18 frontier models
Liu et al. 2024 — "lost in the middle" effect: systematic attention bias toward beginning and end of context
Sequential-NIAH (EMNLP 2025, arXiv 2504.04713) — best model scored 63.5% on ordered extraction across long context
NoLiMa (ICML 2025) — non-literal long-context evaluation: semantic retrieval degrades worse than passkey benchmarks suggest
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Google Research, ICLR 2026) — KV cache compression to 2.5-3.5 bits with near-zero quality loss

CPU inference and memory bandwidth

AMD EPYC 9554 llama.cpp benchmarks (ahelpme.com) — EPYC 9554 achieves 50 t/s on 8B Q4_K_M, confirming memory bandwidth as primary bottleneck
OpenBenchmarking.org llama.cpp results — 88 public CPU benchmark results, median 14.2 t/s on 8B Q8_0
ik_llama.cpp CPU performance — Ryzen 7950X benchmarks showing 1.8-5.2x speedup over mainline llama.cpp for prompt processing
ik_llama.cpp — optimized llama.cpp fork with improved CPU kernels
CPU token rate formula: theoretical max t/s ≈ memory bandwidth (GB/s) / model size (GB); actual ~50-70% of theoretical due to overhead

Hardware specifications

Speeds in this article use Q4_K_M quantization unless otherwise noted. Consumer GPU system power includes GPU + ~120W for the rest of the system; enterprise GPUs use +200W for server chassis. Electricity rate: EUR 0.15/kWh (Finnish residential average). Enterprise GPU purchase prices are estimated April 2026 market rates and shift rapidly
VRAM context limits use Llama 3.1 8B KV cache parameters (128 MB/1K tokens at FP16, 64 MB/1K at Q8, 32 MB/1K at Q4). Other model architectures vary; Qwen3 8B uses ~144 MB/1K and Gemma 4 31B dense uses ~850 MB/1K due to its 16 KV-head architecture

Model stability and provider changes

Primary sources (Anthropic)

Anthropic Engineering — postmortem of three recent issues (September 2025) — context-window routing error (5 Aug–18 Sep, peaked at 16% of Sonnet 4 requests), output corruption with Thai/Chinese characters (25 Aug–2 Sep), and top-k XLA:TPU miscompilation affecting Haiku 3.5 (25 Aug onward). Contains the quality commitment quote: "We never reduce model quality due to demand, time of day, or server load."
Anthropic — Adaptive Thinking documentation — defines effort levels (`low`, `medium`, `high` default, `xhigh`, `max`) and acknowledges that at lower effort levels Claude may skip thinking for simpler problems; includes the "silent change" phrasing for the Opus 4.7 `display` default shift from `summarized` to `omitted`
Anthropic — Extended Thinking documentation — reference for manual `budget_tokens` mode, now deprecated on Opus 4.6 and Sonnet 4.6, and removed entirely on Opus 4.7
Anthropic API release notes — dated changelog including the 5 February 2026 Opus 4.6 launch (adaptive thinking recommended), 16 April 2026 Opus 4.7 launch, and interim platform changes
Anthropic — Claude Opus 4.6 launch announcement
Anthropic — Claude Opus 4.7 launch announcement (16 April 2026) — raises the default Claude Code harness effort to `xhigh`; on the API, adaptive thinking becomes the only supported thinking mode for Opus 4.7

User telemetry and bug reports

GitHub issue anthropics/claude-code #42796 — AMD engineer Stella Laurenzo's telemetry from 6,852 Claude Code sessions: 73% reasoning-depth collapse (~2,200 → ~600 chars), Read:Edit ratio collapse (6.6 → 2.0), 80x increase in deduplicated API request volume, 170x input-token growth, 122x daily-cost growth, 156% reasoning-loop growth, 556% user-interrupt growth (January → March 2026)
GitHub issue anthropics/claude-code #35296 (filed 17 March 2026) — Claude Code user documenting 1M context window degradation, including the model itself recommending session restart by 48% context utilization
GitHub issue anthropics/claude-code #19468 — tracks the client-side Opus→Sonnet/Haiku downgrade bug class (duplicates #3434, #6602, #13242, #17966): settings ignored, quota fallback without notice, reauthentication reverting to Haiku
GitHub issue anthropics/claude-code #16073 — early January 2026 quality-regression complaint with session-level detail
GitHub issue anthropics/claude-code #49593 — ~14% context-window bloat introduced in Claude Code v2.1.111; Tool Search cut MCP overhead 46.9% (51K → 8.5K tokens), indicating the baseline was severe

Cross-provider and historical context

Chen, Zaharia, Zou — How is ChatGPT's Behavior Changing over Time? (arXiv 2307.09009, July 2023) — controlled measurement of GPT-4 falling from 84% to 51% on prime identification between March and June 2023 on identical prompts; the canonical academic citation for frontier-model drift
OpenAI public acknowledgment of GPT-4 Turbo "laziness" (December 2023) — statement that the model had not been updated since 11 November, yet user behavior had shifted; fix shipped January 2024

Trade press coverage of the 2026 Claude episode

Fortune (14 April 2026) — customer-backlash reporting tied to the adaptive-thinking default and effort-level reductions
The Register (31 March 2026) — coverage of the Claude Code quota-blowup episode, including Anthropic's acknowledgment that users were hitting limits faster than expected
InfoWorld (late March 2026) — reporting on Claude Code peak-hour session-cap tightening (US weekday 5am–11am PT / 1pm–7pm GMT window)

Trade-press citations above are included for context and were used as corroboration; primary-source material from Anthropic's own documentation, postmortems, and GitHub issue tracker is preferred for any factual claim in this article.

Frequently asked questions

Is self-hosting an LLM cheaper than using an API?

For a single user doing typical interactive work (sub-1M tokens/day), no. The capital cost of a single RTX Pro 6000 ($8,500) plus electricity buys more than 200M tokens of Claude Sonnet API at April 2026 pricing. Local LLM inference becomes cheaper at roughly 10–20M sustained tokens/day — bulk batch workloads, not chat. See the cost comparison section for the full break-even math.

When is local LLM inference worth it?

Three scenarios make a local LLM worth the hardware spend:

Privacy-constrained workloads — data that cannot leave your perimeter (GDPR-sensitive, customer-confidential, regulated industries)
High-throughput batch processing — classification, extraction, summarization, embedding generation at sustained 10M+ tokens/day
Reproducibility requirements — research, evaluation pipelines, or production systems that need the model to behave identically next month and next year (see the model stability section)

Outside these, the API is cheaper and higher quality.

What hardware do I need to run LLMs locally?

Depends on the model size and quality target:

4–8 GB VRAM (RTX 3060, RTX 4060): sub-3B parameter models only — fine-tuned task pipelines, not a general assistant
16–24 GB VRAM (RTX 4090, RTX 3090): up to Gemma 4 31B at Q4 (~1450 ELO, ~50 below Claude Sonnet)
32 GB VRAM (RTX 5090): Gemma 4 31B at Q8 with headroom
96 GB VRAM (RTX Pro 6000 Blackwell): Llama 3.3 70B at full quality, ~1400 ELO
Multi-GPU 200+ GB: frontier open-weight models (GLM-5, Kimi K2.5, DeepSeek V3.2) approaching Sonnet territory

See the hardware comparison section for benchmarks and prices.

Can I run LLMs on a Mac Studio or AMD Strix Halo?

Yes, but slowly. Strix Halo (128 GB unified memory) has 215 GB/s memory bandwidth — roughly 8x slower than an RTX Pro 6000 at the same model size. Mac Studio M3 Ultra is bandwidth-bound similarly. These platforms work for casual local LLM use and large models that simply do not fit anywhere else; they are not competitive with dedicated GPUs for production throughput. Full numbers are in the Strix Halo section.

What is the best GPU for local LLM inference in 2026?

For most users with a single-GPU budget under $10,000: the RTX Pro 6000 Blackwell (96 GB, 1,792 GB/s bandwidth, ~$8,500). It is the only sub-$10K GPU that runs 70B models on a single card without quantizing below Q4. Known issues exist (virtualization reset bug, thermal sensitivity, SM120 kernel compatibility) — full detail on the dedicated page.

For sub-32B model workloads at lower budget, the RTX 5090 (32 GB, $2,900–4,200) offers comparable bandwidth at one-third the price.

Why does Claude Opus seem worse than it used to?

A documented industry pattern in 2026, not a perception issue. Anthropic shipped multiple silent default changes — adaptive thinking effort levels, context-window routing errors, Opus→Sonnet downgrades on quota — that reduced effective output quality without explicit notice. Independent telemetry (Stella Laurenzo's 6,852-session dataset) shows reasoning depth collapsing ~73% and Read:Edit ratio collapsing 6.6 → 2.0 in early 2026. The model stability section has primary sources. This is the strongest case for self-hosting: an open-weight model file behaves identically next year. A managed API does not.

Is running LLMs locally faster than using an API?

Latency yes, throughput rarely. A single RTX Pro 6000 generates 70B tokens at ~34 t/s; Claude Sonnet API averages ~80 t/s but with 200–800 ms round-trip latency. For interactive single-prompt work, local feels snappier. For sustained throughput, APIs scale horizontally and you do not own the cooling, power, or replacement cost.