Self-Hosting LLMs vs API
Pulsed Media owns its datacenter space and runs its own hardware. GPU inference cards get evaluated against the same cost-per-unit math that prices seedboxes and dedicated servers. In April 2026, the math says: cloud APIs win for quality-sensitive work, self-hosted GPUs win for bulk processing and privacy-constrained workloads.
This article has the hardware benchmarks, API pricing, and economics to show why.
The quality problem
The first obstacle is not cost. It is quality.
The best open-weight model you can run on a single consumer GPU (24 GB) is Gemma 4 31B. It scores approximately 1450 on the LMSYS Chatbot Arena ELO scale (arena.ai, preliminary, April 2026). That is 20 points below Sonnet and 50 below Opus. The gap sounds small until you measure it on hard tasks.
| Model | Arena ELO (approx) | Access | Hardware needed |
|---|---|---|---|
| Claude Opus 4.6 | ~1500 | API / subscription | Cloud |
| Claude Sonnet 4.6 | ~1470 | API / subscription | Cloud |
| GLM-5 744B (dense) | ~1451 | Open weights | 6x 96 GB GPUs (~$48,000) |
| Gemma 4 31B (dense) | ~1450 | Open weights | 1x 24 GB GPU at Q4 (~$1,600) |
| Kimi K2.5 1T (MoE) | ~1447 | Open weights | 8x 96 GB GPUs (~$64,000) |
| Gemma 4 26B-A4B (MoE) | ~1441 | Open weights | 1x 24 GB GPU at Q4 (~$1,600) |
| Qwen3-235B-A22B (MoE) | ~1422 | Open weights | 2x 96 GB GPUs (~$16,000) |
| DeepSeek V3.2 685B (MoE, 37B active) | ~1421 | Open weights | 5x 96 GB GPUs (~$40,000) |
| DeepSeek R1 671B (MoE, reasoning) | ~1398 | Open weights / API | 5x 96 GB GPUs (~$40,000) |
| Llama 3.3 70B (dense) | ~1350-1400 | Open weights | 1x 96 GB GPU (~$8,500) |
The table splits into two tiers. Models you can run on a single GPU ($1,600-8,500) top out at ELO ~1450. Models that approach Sonnet territory (GLM-5 at 1451, Kimi K2.5 at 1447) require multi-GPU setups. The costs shown assume RTX Pro 6000 cards at ~$8,000 each via PCIe — but without NVLink, tensor parallelism runs at roughly one-third the throughput of datacenter H100s. An NVLink-equipped H100 SXM configuration running the same models costs 3-5x more ($150,000-320,000).
Gemma 4 31B scores 80.0% on LiveCodeBench v6 and 84.3% on GPQA Diamond. These are strong numbers for an open model. But on agentic workflows where models must maintain coherence across 20+ tool calls, open models fail silently: the output compiles but does the wrong thing, or the agent loops instead of converging. These failures are expensive because nobody notices until the result is wrong.
No amount of hardware spending changes this. The quality ceiling is in the model weights. Even $64,000 in GPUs running Kimi K2.5 does not reach Sonnet.
Quality vs hardware: what each GPU tier can actually run
The quality gap depends on what fits on the hardware you own.
| VRAM | Best model (Q4) | Arena ELO (full precision) | Quality tier | Gap to Sonnet |
|---|---|---|---|---|
| 4 GB | Phi-4 Mini 3.8B or smaller | ~1100-1200 | D | ~270+ ELO |
| 8 GB | Llama 3.1 8B (tight) | ~1250-1300 | C/C+ | ~170-220 ELO |
| 12 GB | Qwen3 14B | ~1350 | B- | ~120 ELO |
| 16 GB | Qwen3 14B (comfortable) | ~1350 | B- | ~120 ELO |
| 24 GB | Gemma 4 31B | ~1450 | B+/A- | ~20 ELO |
| 32 GB | Gemma 4 31B at Q8 | ~1450+ | A- | ~20 ELO |
| 48 GB | Llama 3.3 70B Q4 | ~1350-1400 | B/B+ | ~70-120 ELO |
| 96 GB | Llama 3.3 70B Q8 | ~1400 | B+ | ~70 ELO |
Arena ELO scores above were measured at full precision (BF16/FP16) on the leaderboard, not at Q4. Running at Q4_K_M typically costs 0.3-1 benchmark points (see quantization quality section), so actual Q4 performance is slightly below these numbers — within the "~" approximation for most models, but measurable on hard tasks.
At 4-8 GB VRAM, you are running models that struggle with multi-step reasoning and produce noticeably worse output than a $0.04/MTok API call to Mistral Nemo. At 24 GB, Gemma 4 closes the ELO gap but still falls short on the hardest tasks. Only at 96 GB do you get a 70B model at full quality, and even that is still B+ tier versus API frontier at A+/S.
What 4-8 GB VRAM can run
Most consumer GPUs sold in the last five years have 4-8 GB of VRAM. Can they do anything useful with local LLMs?
4 GB (GTX 1050 Ti, RX 570, integrated graphics)
At 4 GB, the only models that fit are sub-3B at Q4:
| Model | Size at Q4 | Speed (est.) | Useful for |
|---|---|---|---|
| Qwen3 0.6B | ~0.5 GB | 30-50 t/s | Text classification, simple extraction |
| Llama 3.2 1B | ~0.8 GB | 20-30 t/s | Basic summarization, translation |
| SmolLM2 1.7B | ~1 GB | 20-30 t/s | Summarization, rewriting, function calling |
| BitNet b1.58 2B | 0.4 GB | 10-20 t/s (CPU) | Classification, simple Q&A (MMLU 53) |
| Phi-4 Mini 3.8B | ~2.3 GB | 15-25 t/s | Simple coding, Q&A |
These models cannot replace a general-purpose assistant: multi-step reasoning, complex code generation, and long conversations are out of reach. But they are not useless. Fine-tuned sub-2B models beat GPT-4o on structured extraction (NuExtract-tiny 0.5B) and outperform zero-shot GPT-4 on classification tasks with as few as 60-75 training examples per class (arxiv 2406.08660). For single-task pipelines — classification, extraction, routing, PII redaction — a fine-tuned 0.5-2B model on a 4 GB GPU is a legitimate production tool, not a toy.
8 GB (RTX 3060 8GB, RTX 4060, RX 6600)
At 8 GB, a 7B/8B model at Q4 fits with room for ~32K context (Q8 KV cache):
| Model | Size at Q4 | Speed (GPU) | Max context (Q8 KV) | Quality |
|---|---|---|---|---|
| Llama 3.1 8B | ~4.8 GB | ~40-50 t/s | ~32K | C+ |
| Qwen3 8B | ~5.2 GB | ~35-45 t/s | ~28K | C+ |
| Gemma 4 E4B | ~5 GB | ~35-45 t/s | ~28K | C |
| Mistral Nemo 12B | ~7.4 GB | ~25-35 t/s | ~4-8K | C+ |
An 8B model on an 8 GB GPU is the minimum setup that produces output comparable to budget APIs. The constraint is context: at Q4 KV cache, 64K tokens are feasible but tight. At Q8 KV (recommended for retrieval tasks), the ceiling is ~32K.
Mistral Nemo 12B fits at Q4 but leaves almost no room for KV cache. Short conversations only.
Memory bandwidth is the bottleneck
LLM token generation is memory-bandwidth-bound. The GPU reads model weights from VRAM for every single token. Read speed determines generation speed.
| Platform | Memory bandwidth | Llama 3.3 70B Q4 tok/s | Relative speed |
|---|---|---|---|
| RTX Pro 6000 | 1,792 GB/s | ~34 | 1.0x |
| H100 SXM | 3,352 GB/s | ~40 | 1.2x |
| Strix Halo (128 GB unified) | 215 GB/s | ~4.5 | 0.13x |
| DDR5 desktop (dual channel) | ~89 GB/s | ~2.2 | 0.06x |
| DDR4-3200 desktop (dual channel) | ~42 GB/s | ~1 | 0.03x |
The Pro 6000 and H100 are close on single-card workloads because both have enough bandwidth to keep a 70B model fed. The Pro 6000 compensates with more aggressive quantization (Q4 vs FP8), fewer bytes per weight, similar throughput. The H100 pulls ahead with NVLink multi-GPU tensor parallelism, where its 900 GB/s interconnect leaves PCIe (128 GB/s) behind.
A Strix Halo system has 215 GB/s. Same model, same quantization, 8.3x less bandwidth, roughly 8x slower. A desktop CPU on DDR5 dual-channel has ~89 GB/s. 20x slower than the Pro 6000.
No amount of CPU cores or compiler flags changes this. Bandwidth is physics. The cheapest path to 1,792 GB/s in April 2026 is the RTX Pro 6000 at $8,500.
The rough formula: max tokens/sec = Memory Bandwidth (GB/s) / Model Size (GB). Actual throughput is about 50-70% of theoretical. An EPYC 9554 with ~500 GB/s measured bandwidth hits 50 t/s on an 8B Q4 model, which is 55% of the theoretical 111 t/s.
Hardware comparison
Professional GPUs
The RTX Pro 6000 is the only GPU under $10,000 that runs 70B models on a single card without quantizing below Q4. Full specs and known issues are on the dedicated page.
| Model | Quantization | Tokens/sec (generation) |
|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~185 |
| Mistral Nemo 12B | Q4_K_M | 163 |
| Qwen3 30B-A3B (MoE) | Q4_K_M | 252 |
| Llama 3.3 70B | Q4_K_M | 34 |
34 tokens per second on 70B is about 25 words per second. Fast enough that you are not waiting for the model to finish a paragraph.
With vLLM batched serving (concurrent requests), throughput goes well beyond single-stream: 8B at 8,990 t/s, 70B at 1,031 t/s. On models that fit on one card, the Pro 6000 matches the H100 SXM at one-third the cost.
The card has real problems. A virtualization reset bug causes unrecoverable GPU state after VM shutdown. Sustained vLLM inference triggers chip resets at temperatures as low as 28C. The SM120 kernel architecture breaks DeepSeek models. Only the open-source driver works on Blackwell; there is no proprietary option.
Consumer GPUs
| GPU | VRAM | Bandwidth | Llama 3.1 8B Q4 t/s | Max model (Q4) | Price (Apr 2026) |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | 1,790 GB/s | ~300 | ~32B | $2,900-4,200 |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | ~128 | ~24B | $1,600-2,200 |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | ~87-112 | ~32B tight | $800-900 used |
| RTX 4000 SFF Ada | 20 GB GDDR6 | 280 GB/s | ~64 | ~13B | ~$1,250 |
The RTX 5090 has bandwidth matching the Pro 6000 (1,790 GB/s) but only 32 GB of VRAM, which limits it to 32B-class models at Q4. For sub-32B workloads, the 5090 is better price-performance than the Pro 6000.
The RTX 3090 is the last consumer card with NVLink. Two of them ($1,600-1,800 used) give 48 GB combined, enough for 32B models with generous context. For the price of a single RTX 5090, a dual-3090 setup gets similar bandwidth and 50% more VRAM.
Gemma 4 on consumer GPUs: dense vs MoE
Gemma 4 is available in two forms that fit on 24 GB: the 31B dense model and the 26B-A4B MoE. The MoE activates only 3.8B parameters per token, making it dramatically faster.
| Variant | Arena ELO | RTX 4090 speed | RTX 3090 speed | VRAM (Q4) | 256K context? |
|---|---|---|---|---|---|
| 31B Dense | ~1450 | ~25-35 t/s (short ctx) | ~20-35 t/s | 19.6 GB | No (KV cache fills 24 GB at ~32-64K) |
| 26B-A4B MoE | ~1441 | ~50-129 t/s | ~35-40 t/s | ~15.6 GB | Yes (8.4 GB headroom for KV) |
The 31B dense model has a disproportionately large KV cache (~0.85 MB per token at BF16) because of its 16 KV-head architecture. Despite fitting at Q4, context expansion rapidly consumes remaining VRAM. At VRAM-saturated configurations, the RTX 4090 drops to 7.8 t/s on the 31B.
The 26B-A4B MoE uses 4x less KV cache (~0.21 MB/token), runs 2-4x faster, and scores within 9 ELO points. For interactive use on 24 GB hardware, the MoE is the correct choice. The dense model is worth choosing only for maximum coding quality at short context (<8K tokens).
Compact workstations
The Minisforum MS-02 Ultra won a CES 2026 Innovation Award. Compact 4.8L chassis, Intel Core Ultra 9 285HX, up to 256 GB DDR5 ECC, four NVMe slots. Looks great on paper.
The catch: the PCIe slot is low-profile dual-slot only. The best GPU that fits is an RTX 4000 SFF Ada with 20 GB VRAM and 280 GB/s bandwidth. That gets ~64 t/s on an 8B model and cannot run anything above 13B. The MS-02 Ultra is a homelab machine, not an inference server.
Strix Halo
AMD's Ryzen AI Max+ 395 (Strix Halo) puts 128 GB of unified LPDDR5x memory on a single chip. The iGPU can address up to 96 GB as VRAM, matching the RTX Pro 6000 on capacity.
Bandwidth tells the real story: 215 GB/s measured, versus the Pro 6000's 1,792 GB/s. Every model runs 4-8x slower. 70B at 4.5 tokens per second works, but it is painfully slow.
Where Strix Halo is useful: running large MoE models that would not fit on a 24-32 GB consumer GPU. Qwen3 30B-A3B at 66-72 t/s in 128 GB unified memory, or larger MoE models at 20+ t/s. The 128 GB pool is the point, not the speed.
At EUR 2,000-3,000 for a complete system consuming 120W, it costs 3-5x less than an RTX Pro 6000 setup. Same model capacity, 8x less speed.
CPU inference
Server-class CPUs with enough memory channels can run LLMs at usable speeds for small models:
| CPU | Memory bandwidth | 8B Q4 t/s | 70B Q4 t/s | Usable for chat? |
|---|---|---|---|---|
| EPYC 9554 (64c, 8-ch DDR5) | ~500 GB/s | ~50 | ~7 | 8B yes, 70B batch only |
| Dual Xeon Gold 5317 | ~80 GB/s | ~22 | ~3 | 8B marginal, 70B no |
| Desktop DDR5 (dual-channel) | ~89 GB/s | ~20 | ~2 | 8B marginal, 70B no |
| Desktop DDR4 (i5-7500T) | ~38 GB/s | ~4-5 | — | Basic completion only |
CPU inference makes sense when you need to run a model that does not fit in any available VRAM and buying a GPU is not an option. An EPYC with 12-channel DDR5 and 512+ GB RAM can run 70B at 7 t/s. Batch processing, not interactive.
The optimized fork ik_llama.cpp gets 1.8-5x faster prompt processing than mainline llama.cpp on CPUs with AVX-512, though generation speed gains are more modest.
The smallest useful CPU models
For CPU-only machines without a GPU, the models worth considering are limited:
| Model | Parameters | RAM needed | Desktop CPU speed | Quality |
|---|---|---|---|---|
| BitNet b1.58 2B4T | 2.4B | ~0.4 GB | 10-20+ t/s | Classification, simple Q&A; MMLU 53 |
| SmolLM2 1.7B | 1.7B | ~1 GB | 15-25 t/s | Summarization, rewriting; beats Llama 3.2 1B |
| Llama 3.2 1B | 1.3B | ~0.8 GB | 20-30 t/s | Basic tasks; best gains from fine-tuning |
| Llama 3.2 3B | 3.2B | ~2 GB | 10-15 t/s | Simple Q&A, summarization |
| Phi-4 Mini 3.8B | 3.8B | ~2.3 GB | 7-10 t/s | Simple coding, reasoning |
| Llama 3.1 8B | 8B | ~4.8 GB | 4-5 t/s (DDR4) / 15-20 t/s (DDR5) | C+ tier; minimum useful |
Models under 3B parameters are not general-purpose assistants, but they have real production uses beyond research. SmolLM2 1.7B outperforms Llama 3.2 1B on most benchmarks. Gemma 3n E2B runs in 2 GB RAM and exceeds 1300 Elo on LMArena — the first sub-10B model to do so. Fine-tuned sub-2B models match or exceed frontier models on focused tasks like structured extraction and classification (see the 4 GB VRAM section).
Where tiny models add the most value in a self-hosted stack:
- Classification and routing — a 0.5B classifier routes queries to the right model tier, saving inference cost on easy queries
- Structured extraction — parsing invoices, resumes, forms into JSON
- Speculative decoding — a tiny draft model proposes tokens, a large model verifies in parallel, achieving 2-3x speedup with zero quality loss (Google Research)
- PII redaction — well-defined token-level task where small models excel
- Edge deployment — Raspberry Pi 5 runs a 1B model at 7-15 t/s for offline classification or sensor data processing
The 8B tier on CPU is the minimum for genuinely useful general-purpose output. Below that, you are trading generality for specialization — a valid tradeoff for focused pipelines, privacy-constrained environments, and edge deployment.
The context window problem
Real workloads routinely exceed 128K tokens. A medium codebase is 200K-500K tokens. A 50-page contract is 40K tokens. An agentic session with 30+ tool calls accumulates 100K-300K tokens. Document collection analysis, multi-session memory, and RAG over large corpora push past 500K easily.
Gemini 2.5 Pro handles 1 million tokens natively. Claude Opus 4.6 and Sonnet 4.6 handle 1M. GPT-4.1 takes 1M. xAI's Grok 4.20 takes 2M. These are the context windows where real work happens.
Open models top out at 128K. That is not a VRAM limitation — it is a training limitation. No open model available in April 2026 was trained on context beyond 128K-256K, and quality degrades well before the stated maximum. This is the single largest gap between self-hosted and API inference, and no amount of hardware closes it.
VRAM limits on context
Every token in context consumes VRAM for its KV cache entry. The formula: 2 x layers x KV_heads x head_dim x bytes_per_element. For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim), that is ~128 MB per 1,000 tokens at FP16. Q8 KV halves that to ~64 MB/1K with negligible quality loss. Q4 KV halves again but degrades retrieval accuracy.
Trained context windows cap the usable range regardless of VRAM:
| Model | Trained context | Notes |
|---|---|---|
| Llama 3.1 / 3.3 | 128K | Best open 70B; hard ceiling at 128K |
| Qwen3 | 128K | YaRN-extended; quality degrades past training range |
| Gemma 4 | 128K | |
| Mistral Nemo | 128K | |
| Gemini 2.5 Pro (API) | 1,000K | 8x the best open model |
| Claude Opus 4.6 / Sonnet 4.6 (API) | 1,000K | 8x the best open model |
| GPT-4.1 (API) | 1,000K | 8x the best open model |
| Grok 4.20 (xAI API) | 2,000K | 16x the best open model |
Every open model stops at 128K. All four frontier API providers — Anthropic, Google, OpenAI, and xAI — now offer 1M-2M token context windows. For workloads that routinely exceed 128K — codebase analysis, long document chains, agent sessions — this gap is not solvable with more VRAM. The model simply was not trained for it.
All context numbers below use Q8 KV cache (recommended — negligible quality loss vs FP16, half the VRAM). Q4 KV doubles these limits but degrades retrieval accuracy. Models at Q4_K_M weights throughout.
| VRAM | Model | Max context (Q8 KV) | Max context (Q4 KV) | Notes |
|---|---|---|---|---|
| 8 GB | 8B | ~36K | ~73K | Minimum useful setup |
| 14B+ | Does not fit | |||
| 12 GB | 8B | ~112K | 128K (trained limit) | 128K with Q4 KV only |
| 14B | ~43K | ~86K | ||
| 16 GB | 8B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV |
| 14B | ~93K | 128K (trained limit) | ||
| 24 GB | 8B | 128K (trained limit) | 128K (trained limit) | Full context |
| 14B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV | |
| 32B | ~35K | ~70K | Short context only | |
| 48 GB | 14B | 128K (trained limit) | 128K (trained limit) | Full context |
| 32B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV | |
| 70B | ~46K | ~92K | Short-to-medium context | |
| 80 GB | 32B | 128K (trained limit) | 128K (trained limit) | Full context |
| 70B | ~125K | 128K (trained limit) | Full context at Q4 KV; near-full at Q8 | |
| 96 GB | 70B | 128K (trained limit) | 128K (trained limit) | Full context; best single-GPU for 70B |
| 128 GB (Strix Halo / CPU) | 70B | 128K (trained limit) | 128K (trained limit) | Full context; ~4.5 t/s (slow) |
KV cache per token at Q8: 8B = 64 MB/1K, 14B = 80 MB/1K, 32B = 128 MB/1K, 70B = 160 MB/1K. Q4 halves these. 1 GB headroom reserved for runtime overhead.
For CPU/RAM inference, the same arithmetic applies but "VRAM" becomes "available RAM after OS and model." A 64 GB DDR4 system running 8B Q4 (4.7 GB model + OS overhead) has roughly 55 GB available for KV cache — same capacity as a 48 GB GPU, but 10-15x slower due to DDR4 bandwidth (~42 GB/s vs GPU's 900+ GB/s).
Context rot: accuracy degrades with length
Even when the VRAM fits, model accuracy drops as context grows. The Chroma 2025 study tested 18 models and found 20-50% accuracy degradation from 10K to 100K tokens across every model tested, including frontier APIs. Open models degrade faster.
The worst-case position is the middle of the context window. Beginning and end receive stronger attention (the "lost in the middle" effect, Liu et al. 2024). For retrieval tasks where the target information can appear anywhere, this creates systematic blind spots.
On harder retrieval tasks requiring ordered extraction across long context (Sequential-NIAH, arXiv 2504.04713, EMNLP 2025), the best model tested scored 63.5%. Standard "find the passkey" benchmarks show 90-100% at 128K; the realistic number for semantic retrieval is 50-65%. This is a model capability ceiling, not a hardware limitation.
Non-literal retrieval is worse still. The NoLiMa benchmark (ICML 2025) tested 13 LLMs on tasks requiring semantic similarity matching rather than exact text lookup. Open-source models showed inverted U-shaped performance curves beyond critical context thresholds — accuracy degrades substantially when the task requires reasoning about meaning rather than matching strings.
KV cache quantization compounds these losses. FP16 and Q8 KV cache produce no measurable retrieval degradation. Q4 KV cache adds detectable accuracy loss on top of context rot, particularly on retrieval tasks. For any workload where finding information in context matters, Q8 KV cache is the minimum — Q4 saves VRAM at the cost of making the context less reliable.
TurboQuant: KV cache compression (ICLR 2026)
TurboQuant (Google Research, ICLR 2026) compresses KV cache to 2.5-3.5 bits per channel with near-zero quality loss. This is not weight quantization — it is complementary to Q4_K_M model weights. You can run a model at Q4 weights AND TurboQuant 3-bit KV cache simultaneously.
| KV cache method | Bits/channel | LongBench score | Needle retrieval | VRAM per 1K tokens (8B model) |
|---|---|---|---|---|
| FP16 (baseline) | 16 | 50.06 | 0.997 | 128 MB |
| Q8 (current standard) | 8 | ~50.06 | ~0.997 | 64 MB |
| TurboQuant | 3.5 | 50.06 | 0.997 | ~28 MB |
| TurboQuant | 2.5 | 49.44 | — | ~20 MB |
| Q4 (current) | 4 | degrades | degrades | 32 MB |
| KIVI 3-bit | 3 | 48.50 | 0.981 | ~24 MB |
At 3.5 bits, TurboQuant matches FP16 quality exactly — identical LongBench score, identical needle retrieval. At 2.5 bits, it outperforms KIVI at 3 bits. The method is data-oblivious: no calibration data, no per-model tuning. It works by applying a random orthogonal rotation that makes vector coordinates near-independent, then using mathematically optimal scalar quantizers.
If TurboQuant 3-bit KV becomes standard, the VRAM context table above roughly doubles: an 8B model on 24 GB at 3-bit KV would reach 128K comfortably instead of needing Q8 at 128K. A 70B model on 96 GB at 3-bit KV would push past 200K context.
Status (April 2026): No official Google implementation. Community ports exist for llama.cpp (TQ3_0 format, not merged), vLLM (Triton kernels, not merged), and PyTorch/HuggingFace. None are in mainline frameworks yet. Official integration expected Q2-Q3 2026.
Hardware cost to match API context
| API capability | Best local equivalent | Hardware cost | What you actually get |
|---|---|---|---|
| GPT-4.1 at 128K (A- quality) | 8B Q4 on RTX 4090 | ~EUR 1,800 | 128K context but C+ quality — 2 tiers below |
| GPT-4.1 at 128K (A- quality) | 70B Q4 on 2x RTX 4090 | ~EUR 3,500 | 128K context at B+ quality — 1 tier below |
| Claude 1M (A+ quality) | 70B Q4 on A100 80GB | ~EUR 15,000 | Capped at 128K (872K shorter), B+ quality |
| Claude 1M (A+ quality) | Nothing | — | No open model trained past 128K |
| Gemini 1M (A+ quality) | Nothing | — | Not achievable locally at any price |
The gap is stark beyond 128K. No combination of hardware and open models reaches 200K context. All four frontier API providers now offer 1M-2M at A+ quality. Self-hosted tops out at 128K at B+ quality. For workloads that need the full context — codebase analysis, long document processing, extended agent sessions — APIs are the only option.
Working around the 128K ceiling
Several techniques exist to process more data than fits in a single context window. None of them are transparent substitutes for native long context.
RAG (Retrieval-Augmented Generation) is the most production-ready approach. Split documents into chunks, embed them in a vector database, retrieve only the relevant chunks per query. Quality reaches 70-85% of native long context for retrieval tasks (EMNLP 2024), but fundamentally cannot do cross-document reasoning — connecting a clause on page 40 to a definition on page 3 requires both to be in context simultaneously. RAG turns the problem from "LLM reasons over all data" into "retrieval system selects data, LLM reasons over selection." The retrieval quality becomes the ceiling.
Context compression (LLMLingua, Microsoft) removes unimportant tokens to fit more into the same window. At 2-4x compression, quality loss is minimal. At 10-20x, degradation is measurable. A 128K window with 4x compression gives roughly 512K effective tokens — useful, but still below 1M native context and with quality loss on compressed portions.
Map-reduce / chunking processes each chunk independently, then combines results. Works for summarization (60-80% quality). Fails for reasoning that requires cross-chunk information.
Hierarchical summarization — summarize chunks, then summarize summaries — works for gist but fails for precision. Each summarization round is lossy, and hallucination amplification compounds: a fabricated fact in round 2 becomes "established fact" in round 3.
StreamingLLM solves a different problem. It enables continuous generation over arbitrarily long streams by preserving "attention sink" tokens, but the model can only see the current window. Information outside the window is gone. Useful for long chat sessions, not for document analysis.
| Technique | Production ready? | Quality vs native long context | Real substitute? |
|---|---|---|---|
| RAG | Yes | 70-85% (retrieval) / poor (reasoning) | Partial — good for search, poor for synthesis |
| LLMLingua compression | Yes | 85-95% at 2-4x compression | Partial — extends window modestly |
| Map-reduce / chunking | Yes | 60-80% depending on task | No — loses cross-chunk connections |
| Hierarchical summarization | Yes (for summaries) | 50-70% for detail; good for gist | No — lossy compression compounds |
| StreamingLLM | Yes (for chat) | N/A (different problem) | No — does not extend reasoning context |
The LaRA benchmark (ICML 2025) tested 2,326 cases across multiple tasks and concluded there is no silver bullet — the best approach depends on task type, model capabilities, and retrieval characteristics. One concrete data point: switching from chunked medical records to full-context patient histories improved diagnostic accuracy by 23% because the model could see temporal patterns across years of data.
The "lost in the middle" problem (Liu et al., TACL 2024) complicates even native long context: models perform best when relevant information is at the beginning or end of the context, with 30%+ performance degradation for information in the middle. This affects all models, including those designed for long context.
For self-hosters, the practical hierarchy is: if the workload fits within 128K, local models are competitive. If it needs 128K-512K, RAG or compression can partially bridge the gap with measurable quality penalties. Beyond 512K, APIs with native 1M context are the only reliable option.
API pricing (April 2026)
API pricing dropped roughly 80% since early 2025. Budget-tier models now cost $0.02-0.40 per million tokens.
| Tier | Model | Input $/MTok | Output $/MTok | Context | Quality |
|---|---|---|---|---|---|
| Budget | Mistral Nemo | $0.02 | $0.04 | 128K | C+ |
| Budget | GPT-4.1-nano | $0.05 | $0.20 | 1M | C+ |
| Budget | GPT-4.1-mini | $0.16 | $0.64 | 1M | B- |
| Mid | DeepSeek V3.2 | $0.28 | $0.42 | 128K | B+ |
| Mid | Gemini 2.5 Flash | $0.15 | $3.50 | 1M | B+ |
| Mid | GPT-4.1 | $2.00 | $8.00 | 1M | A- |
| Mid | Grok 4.1 Fast | $0.20 | $0.50 | 2M | B+ |
| Mid | Grok 4.20 | $2.00 | $6.00 | 2M | A |
| Mid | Claude Haiku 4.5 | $1.00 | $5.00 | 200K | A- |
| Frontier | Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | A+ |
| Frontier | Gemini 2.5 Pro | $1.25 | $10.00 | 1M | A+ |
| Frontier | Claude Opus 4.6 | $5.00 | $25.00 | 1M | S |
| Frontier | OpenAI o3 | $2.00 | $8.00 | 200K | S (reasoning) |
Subscription tiers exist for heavy users: Claude Pro at $20/month, Claude Max at $100-200/month with higher rate limits. Google, OpenAI, and others have similar tiered access. The per-token rates above are the pay-as-you-go ceiling, not the effective cost for regular users.
Cached input pricing drops costs further. Claude cache hits cost 0.1x the base rate. Gemini 2.5 Flash cached input is $0.03/MTok. Workloads with repeated context (RAG, agent loops) benefit from caching.
The cost comparison
Self-hosted at 100% utilization
Best case for self-hosting: a GPU running batch inference around the clock.
| Setup | Model | Monthly cost (3yr amort.) | Tokens/month | EUR/MTok | Quality |
|---|---|---|---|---|---|
| RTX Pro 6000 | 70B vLLM batched | ~EUR 380 | ~2.7B | EUR 0.14 | B/B+ |
| RTX Pro 6000 | 8B vLLM batched | ~EUR 380 | ~23B | EUR 0.017 | C+ |
| 2x RTX 3090 | 8B single-stream | ~EUR 130 | 3.7B | EUR 0.48 | C+ |
| Strix Halo | 70B single-stream | ~EUR 90 | ~13M | EUR 0.50 | B/B+ |
All costs in EUR. RTX Pro 6000 at ~$8,500 = ~EUR 7,800 at April 2026 exchange rates. 3-year amortization includes hardware + electricity at EUR 0.15/kWh.
At 100% utilization, self-hosted 8B on an RTX Pro 6000 matches Mistral Nemo API pricing (EUR 0.034/MTok) while running the model locally. Self-hosted 70B at EUR 0.14/MTok undercuts DeepSeek V3.2 API (EUR 0.39/MTok) for comparable B+ output.
The utilization trap
Nobody runs a single-operator GPU at 100%. You ask questions during working hours. The GPU idles overnight, on weekends, during meetings. Realistic utilization for one person or a small team: 10-30%.
| Utilization | RTX Pro 6000 70B EUR/MTok | vs DeepSeek V3.2 API (EUR 0.39) |
|---|---|---|
| 100% | EUR 0.14 | Self-hosted 2.8x cheaper |
| 50% | EUR 0.28 | Self-hosted 1.4x cheaper |
| 20% | EUR 0.70 | API 1.8x cheaper |
| 10% | EUR 1.40 | API 3.6x cheaper |
At 20% utilization, self-hosted 70B costs EUR 0.70/MTok for B+ quality output. The same money buys Gemini 2.5 Flash at B+ quality via API, with no hardware to maintain and 1M token context.
The GPU does not know you went to lunch. It depreciates whether it generates tokens or not.
Hardware depreciation
The RTX Pro 6000 costs ~EUR 7,800 today. When NVIDIA ships Rubin (next-generation architecture, expected ~2027), the Pro 6000 will be worth roughly EUR 3,700-4,600. Three years out: EUR 2,300-2,800. GPU depreciation runs 40-60% over 2-3 years.
API subscriptions depreciate at 0%. Every month brings the latest models. No replacement planning, no resale.
The lease option (EUR 500/month for 18 months, EUR 9,000 total) costs more than buying outright.
Quantization
Quantization compresses model weights to use less memory and run faster, at some quality cost.
70B model
| Quantization | Bits/weight | 70B model size | Quality vs FP16 | Speed vs FP16 | Minimum GPU |
|---|---|---|---|---|---|
| FP16 (full precision) | 16.0 | ~140 GB | 100% | 1.0x (baseline) | 2x H100 SXM |
| Q8_0 | 8.0 | ~70 GB | ~99% | ~1.8-2.0x | 1x RTX Pro 6000 |
| Q6_K | 6.5 | ~54 GB | ~99% | ~2.3-2.5x | 1x RTX Pro 6000 |
| Q5_K_M | 5.5 | ~48 GB | ~96-97% | ~2.5-3.0x | 1x RTX Pro 6000 |
| Q4_K_M (common default) | 4.8 | ~40 GB | ~96-99% (model-dependent) | ~3.0-3.5x | 1x RTX Pro 6000 |
| Q3_K_M | 3.4 | ~33 GB | ~90-95% | ~3.5-4.0x | 2x RTX 3090 |
| Q3_K_S | 3.0 | ~28 GB | ~85-92% | ~4.0-4.5x | 2x RTX 4060 Ti 16GB |
| Q2_K (emergency) | 2.7 | ~27 GB | ~75-85% | ~4.0-5.0x | 1x RTX 5090 |
| AQLM 2-bit (trained codebooks) | 2.0 | ~18 GB | ~95-98% | GPU-only | 1x RTX 4090 |
| BitNet 1.58-bit (hypothetical) | 1.58 | ~14 GB | unknown at 70B | CPU-native | 1x RTX 3060 (no model exists) |
8B model (Llama 3.1 8B)
| Quantization | Bits/weight | 8B model size | Quality vs FP16 | Minimum GPU |
|---|---|---|---|---|
| FP16 | 16.0 | ~16 GB | 100% | 1x RTX 4090 (24 GB) |
| Q8_0 | 8.0 | ~8.5 GB | ~99% | 1x RX 5700 XT (8 GB, tight) |
| Q6_K | 6.5 | ~6.3 GB | ~99% | 1x RX 5700 XT (8 GB) |
| Q5_K_M | 5.5 | ~5.7 GB | ~98-99% | 1x RX 5700 XT (8 GB) |
| Q4_K_M (common default) | 4.8 | ~4.7 GB | ~96-99% | 1x GTX 1060 6GB |
| Q3_K_M | 3.4 | ~3.7 GB | ~85-90% | 1x GTX 1050 Ti (4 GB, tight) |
| Q2_K | 2.7 | ~3.1 GB | ~75-85% | 1x GTX 1050 Ti (4 GB) |
Gemma 4 26B-A4B (MoE)
Mixture-of-experts models store all 26B parameters but only activate 4B per token. Storage follows total parameter count; inference speed follows active parameter count.
| Quantization | Total size (26B stored) | Active per token (4B) | Minimum GPU | Effective speed |
|---|---|---|---|---|
| FP16 | ~52 GB | ~8 GB | 1x RTX Pro 6000 | Baseline |
| Q8_0 | ~26 GB | ~4 GB | 1x RTX 5090 (32 GB) | ~2x |
| Q4_K_M | ~16 GB | ~2.4 GB | 1x RTX 4060 Ti 16GB | ~3-3.5x |
| Q2_K | ~8 GB | ~1.3 GB | 1x RX 5700 XT (8 GB) | ~4-5x, quality loss |
At Q4_K_M, the full Gemma 4 MoE fits on a 16 GB card with room for KV cache, while only reading ~2.4 GB of active weights per token. That makes it extremely fast: the RTX Pro 6000 hits 252 t/s on a similar MoE architecture (Qwen3 30B-A3B).
What quantization actually costs in quality
The degradation curve has two phases. From Q8 through Q4_K_M, quality loss is measurable but rarely perceptible. Below Q4, quality collapses non-linearly.
| Quant | Size reduction | Perplexity loss | Benchmark avg (Llama 3.1 8B) | Verdict |
|---|---|---|---|---|
| Q8_0 | 47% | +0.01% | 69.41 (vs 69.47 FP16) | Indistinguishable from FP16 |
| Q6_K | 59% | +0.06% | 69.23 | Negligible loss |
| Q5_K_M | 64% | +0.2% | 69.36 | Blind testers cannot distinguish from FP16 |
| Q4_K_M | 69% | +0.7% | 69.15 | Recommended default — 0.32 points below FP16 |
| Q3_K_M | 75% | +3.3% | 68.07 | Usable but noticeable on reasoning tasks |
| Q3_K_S | 77% | +5.7% | 65.49 | Math drops from 77.6 to 68.3 (GSM8K) |
| Q2_K | 82% | +22% | — | Extreme quality loss; perplexity climbs non-linearly |
| IQ2_XXS | 85% | +18% | — | Catastrophic on small models (7% accuracy vs 42% at Q4) |
Sources: llama.cpp Discussion #2094 (canonical perplexity), arxiv 2601.14277 (full benchmark table, Jan 2025), GFMath (IQ2_XXS catastrophe on ~70 models).
A blind test with 500+ votes on Mistral 7B confirmed: human evaluators cannot distinguish Q5_K from FP16, but consistently identify IQ1_S as worse.
What the quality loss looks like in practice: At Q8 and Q5, output is indistinguishable from full precision — same word choices, same reasoning chains, same code. At Q4_K_M, output reads identically on most prompts; edge cases in math and multilingual tasks occasionally produce different (slightly worse) answers. At Q3, reasoning tasks start showing wrong intermediate steps — the model "almost" gets it but takes a wrong turn more often. At Q2_K and below, output visibly degrades: repetitive phrasing, lost coherence on longer responses, math errors on problems the full-precision model solves correctly, and noticeably worse code generation. The blind test data matches: humans cannot spot the difference until around Q3, and confidently identify degradation at IQ1_S.
Tasks hurt differently. Coding and STEM degrade most from quantization (IJCAI 2025). Multilingual loses 15-20% at Q4 (ionio.ai). Instruction following (IFEval) drops >10% at Q4 on some model families. Red Hat's 500K evaluations on Llama 3.1 show >99% recovery at 8-bit and 96-99% at 4-bit with calibrated methods.
Model family matters: Qwen2.5 is the most quantization-tolerant family. LLaMA 3.3 is the most fragile — not recommended at Q4 or below.
Bigger quantized beats smaller full precision
If VRAM is the constraint, always choose the larger model at lower precision over a smaller model at full precision. The data is clear:
| Configuration | Perplexity | VRAM | Model family |
|---|---|---|---|
| 13B at Q4 | 5.41 | ~8 GB | Llama 1 |
| 7B at FP16 | 5.96 | ~14 GB | Llama 1 |
| 65B at Q2_K | 4.10 | ~20 GB | Llama 1 |
| 33B at FP16 | 4.16 | ~66 GB | Llama 1 |
These numbers are from the original Llama 1 family (llama.cpp canonical perplexity data). The principle holds across newer model families, though quantization tolerance varies: Qwen2.5 is the most tolerant, LLaMA 3.3 the most fragile.
The 13B at Q4 uses half the VRAM of the 7B at FP16 and produces better output. The 65B at Q2_K matches the 33B at FP16 in one-third the memory. Parameter count beats precision until you hit extreme quantization (below Q3), where both advantages erode.
All four major 4-bit methods (GPTQ, AWQ, EXL2, GGUF Q4_K_M) achieve nearly identical quality (perplexity 4.31-4.36 on Llama-2-13B). The differences are speed (EXL2 fastest) and VRAM efficiency (GPTQ most efficient), not quality.
Extreme quantization: ternary weights and sub-2-bit
All of the above methods are post-training quantization, compressing an FP16 model after it has been trained. A different approach trains the model with quantized weights from the start.
BitNet b1.58 (Microsoft Research / Tsinghua University) constrains every weight to one of three values: {-1, 0, +1}. That is log2(3) = 1.58 bits per weight. Multiplications become additions. A 2B parameter model fits in 0.4 GB instead of ~4 GB at FP16, roughly 10x smaller.
The only production-quality open model is bitnet-b1.58-2B-4T (2 billion parameters, 4 trillion training tokens, Apache 2.0 license). Community models exist at 8B but are experimental.
| Model | MMLU (5-shot) | Size | Speed (i7-13800H) | EUR/MTok (15W laptop) |
|---|---|---|---|---|
| BitNet b1.58 2B | 53.17 | 0.4 GB | ~34 t/s | EUR 0.006 |
| Llama 3.2 1B (FP16) | 49.30 | ~2.5 GB | ~21 t/s | EUR 0.010 |
| Qwen2.5 1.5B (FP16) | 60.25 | ~3 GB | ~15 t/s | EUR 0.014 |
The energy savings are real: 71-82% less energy than FP16 on x86, 55-70% on ARM. These numbers come from Microsoft's bitnet.cpp paper and have been verified across multiple sources. The inference framework (bitnet.cpp) is production-ready on CPU, with GPU support added in May 2025.
There is one catch that matters for anyone evaluating this: speed and energy benefits only appear when using bitnet.cpp. Loading the model in standard HuggingFace transformers gives you the smaller memory footprint but none of the ternary kernel speedups. You cannot use Ollama, LM Studio, or any other llama.cpp frontend for native BitNet inference as of April 2026.
Post-training methods can also push below 2 bits. AQLM (Yandex/IST Austria, ICML 2024) compresses Llama 2 7B to 2.02 bits with a WikiText2 perplexity of 6.59, versus 5.47 at FP16, roughly 2-5% task quality degradation. QTIP (NeurIPS 2024 Spotlight) achieves similar quality at 2-bit with >3x inference speedup. These are GPU-only methods: they reduce VRAM usage, not power draw.
Does extreme quantization change the economics?
At first glance, 10x smaller models should flip the self-hosting math. A 2B model in 0.4 GB running on a 15W laptop at 25 tokens per second costs about EUR 0.006 per million tokens in electricity, 6x cheaper than Mistral Nemo API.
The problem is quality. BitNet 2B scores MMLU 53. Mistral Nemo 12B scores around 70. These are not the same tier of model. A 2B model, regardless of quantization method, cannot follow complex instructions, write useful code, or maintain coherence over long conversations. Saving EUR 0.03 per million tokens while getting dramatically worse output is not a savings.
The real promise is running larger models on cheaper hardware. A hypothetical 70B BitNet model would need ~18 GB instead of 140 GB, fitting on a single RTX 3090. But the model zoo has not caught up to the technique. Microsoft's verified model is 2B. PrismML released Bonsai 8B in March 2026, claiming 1.15 GB footprint and 368 t/s on an RTX 4090, but these are vendor claims and independent benchmarks are still sparse. Until larger ternary models are independently verified, extreme quantization changes what is theoretically possible without changing what you can actually run today.
Software
Running a model locally in 2026 is straightforward. The tooling is good.
Ollama wraps llama.cpp behind a REST API with model management. Install, pull a model, start chatting.
llama.cpp is the foundation. CUDA, Vulkan, Metal, CPU with AVX-512/AMX. GGUF quantization format.
ik_llama.cpp is an optimized fork of llama.cpp with rewritten CPU kernels. 1.8-5.2x faster prompt processing on AVX2/AVX-512 CPUs. Same GGUF models, drop-in replacement. Use this instead of mainline llama.cpp for CPU-bound workloads. Benchmarks in the CPU comparison section.
vLLM handles concurrent serving with PagedAttention and continuous batching. Use this when serving multiple users from one GPU.
bitnet.cpp is required for native ternary inference. Separate from llama.cpp. CPU-primary, with CUDA support since May 2025. Requires Clang 18+ to build.
The cheapest hardware myth
A common misconception: "I already own the hardware, so inference is free." It is not. Electricity costs money, and consumer GPUs draw 230-415W under inference load. Even ignoring the purchase price entirely, the electricity to generate tokens locally costs more per token than calling a budget API.
Proof: idle desktop CPU
Take an Intel i5-7500T in an HP ProDesk 400 G3 Desktop Mini. A machine that costs EUR 150 used and sits idle most of the day. DDR4-2400 dual-channel gives ~38 GB/s theoretical memory bandwidth. At 50% efficiency with 4 threads, that translates to roughly 4-5 tokens per second on an 8B Q4 model.
The system draws about 22W idle and 45W under inference load. The marginal 23W for inference costs EUR 2.48 per month in electricity at Finnish rates (EUR 0.15/kWh).
| Model | Size on disk | Tokens/sec (est.) | Tokens/month (24/7) | EUR/MTok (electricity only) | vs Mistral Nemo API |
|---|---|---|---|---|---|
| Qwen3 0.6B Q4 | ~0.4 GB | ~40 | 103.7M | EUR 0.024 | 1.5x cheaper, but trivial tasks only |
| BitNet b1.58 2B (ternary) | 0.4 GB | ~15-25 | 51.8M | EUR 0.048 | 1.3x more expensive, D-tier quality |
| TinyLlama 1.1B Q4 | ~0.7 GB | ~28 | 72.6M | EUR 0.034 | Break-even, basic completion only |
| Bonsai 8B (1-bit, unverified) | 1.15 GB | ~15 | 38.9M | EUR 0.064 | 1.7x more expensive, claims 8B quality |
| Phi-4 Mini 3.8B Q4 | ~2.2 GB | ~7.6 | 19.7M | EUR 0.126 | 3.4x more expensive |
| Llama 3.1 8B Q4 | ~4.7 GB | ~4-5 | 11.7M | EUR 0.212 | 5.7x more expensive |
BitNet b1.58 2B runs via bitnet.cpp (not llama.cpp). The ternary weights reduce memory reads, but the i5-7500T's 4 cores with AVX2-only still cap throughput around 15-25 t/s. Bonsai 8B speed is estimated from memory bandwidth; PrismML's claimed 368 t/s is on an RTX 4090 GPU. Independent CPU benchmarks for Bonsai are sparse as of April 2026.
At the same C+ quality tier (8B model), the API costs EUR 0.037 per million tokens. The i5-7500T costs EUR 0.212. The API is 5.7 times cheaper on electricity alone, before counting the computer's purchase price.
The extreme quantization models do not change the math. BitNet 2B at EUR 0.048/MTok is 1.3x more expensive than the API and produces D-tier output (MMLU 53 vs Mistral Nemo's ~70). The only models where electricity beats the API are sub-1B parameter models at Q4, and their output quality is so far below Mistral Nemo that the comparison is meaningless.
CPU comparison: DDR4 to DDR5
The i5-7500T is the cheapest case. How do faster CPUs compare? Memory bandwidth is the bottleneck — more channels, faster memory, more tokens per second.
| CPU | Memory | Bandwidth (theoretical) | Cores | System idle/load | Llama 3.1 8B Q4 t/s (est.) | EUR/MTok (marginal electricity) |
|---|---|---|---|---|---|---|
| Intel N100 | DDR5-4800 1-ch | 38.4 GB/s | 4E | 8W / 16W | ~3 | EUR 0.111 |
| Intel i3-N305 | DDR5-4800 1-ch | 38.4 GB/s | 8E | 10W / 22W | ~3.5 | EUR 0.143 |
| Intel i5-7500T | DDR4-2400 2-ch | 38.4 GB/s | 4C/4T | 22W / 45W | ~4-5 | EUR 0.212 |
| Intel i5-8500T | DDR4-2666 2-ch | 42.7 GB/s | 6C/6T | 22W / 48W | ~5-6 | EUR 0.180 |
| AMD Ryzen 5 5600X | DDR4-3200 2-ch | 51.2 GB/s | 6C/12T | 45W / 95W | ~6-7 | EUR 0.297 |
| AMD Ryzen 9 5900X | DDR4-3200 2-ch | 51.2 GB/s | 12C/24T | 55W / 120W | ~6-8 | EUR 0.338 |
| AMD Ryzen 9 7950X | DDR5-5200 2-ch | 83.2 GB/s | 16C/32T | 65W / 145W | ~10-12 | EUR 0.277 |
Electricity rate: EUR 0.15/kWh. EUR/MTok calculated from marginal power (load minus idle) at estimated token rate, 24/7 operation. N100 and i3-N305 are single-channel memory, which caps bandwidth despite DDR5 speeds. The Ryzen 7950X is 2-3x faster than the i5-7500T thanks to DDR5 dual-channel (83 GB/s vs 38 GB/s).
Every CPU in this table costs 3x to 9x more per token in electricity than the Mistral Nemo API (EUR 0.037/MTok). The fastest desktop CPU tested (Ryzen 7950X) still costs 7.5x more. More cores do not help — dual-channel DDR4 or DDR5 memory bandwidth is the wall.
One exception to the "llama.cpp is llama.cpp" assumption: ik_llama.cpp, an optimized fork, achieves 1.8-5.2x faster prompt processing than mainline llama.cpp on the same hardware. Benchmarks on a Ryzen 7950X (16 threads, AVX2):
| Quantization | ik_llama.cpp (t/s) | llama.cpp (t/s) | Speedup |
|---|---|---|---|
| BF16 | 256.9 | 78.6 | 3.3x |
| Q8_0 | 268.2 | 147.9 | 1.8x |
| Q4_0 | 273.5 | 153.5 | 1.8x |
| IQ3_S | 156.5 | 30.2 | 5.2x |
| Q8_K_R8 | 370.1 | N/A | new format |
These are prompt processing (prefill) speeds, not generation. Generation improves a more modest 1.0-2.1x. The key finding: on the Ryzen 7950X, the slowest quantization type in ik_llama.cpp is faster than the fastest type in mainline llama.cpp for prompt processing. For CPU inference workloads that involve large prompts — document summarization, RAG extraction, batch classification — the fork makes a material difference. It does not change the electricity cost comparison (the watts are the same), but it reduces wall-clock time per job, which matters for throughput-limited batch work.
Proof: every consumer GPU loses
The pattern holds across dedicated GPUs. Inference power draw runs about 55-75% of the card's TDP (the decode phase is memory-bandwidth-bound, not compute-bound). Add 120W for the rest of the system.
| GPU | VRAM | Llama 3.1 8B Q4 tok/s | System watts | EUR/MTok (electricity) | vs API |
|---|---|---|---|---|---|
| RX 5700 XT | 8 GB | ~55 (Vulkan, no ROCm) | 266W | EUR 0.201 | 5.4x |
| RX 6700 XT | 12 GB | ~45 | 270W | EUR 0.250 | 6.8x |
| RX 6900 XT | 16 GB | ~60 | 315W | EUR 0.219 | 5.9x |
| RTX 3060 | 12 GB | ~42 | 230W | EUR 0.228 | 6.2x |
| RTX 3090 | 24 GB | ~87 | 345W | EUR 0.165 | 4.5x |
| RTX 4060 Ti 16GB | 16 GB | ~48 | 227W | EUR 0.197 | 5.3x |
| RTX 4090 | 24 GB | ~128 | 413W | EUR 0.134 | 3.6x |
| RX 7900 XTX | 24 GB | ~80 | 351W | EUR 0.183 | 4.9x |
API baseline: Mistral Nemo at EUR 0.037/MTok ($0.04/MTok output). Electricity rate: EUR 0.15/kWh. RX 5700 XT runs Vulkan only (ROCm dropped Navi 10 support); speed scaled from llama.cpp Vulkan scoreboard data.
Every consumer GPU costs 3.6 to 6.8 times more in electricity per token than the API. The RTX 4090, the fastest consumer card, still costs 3.6x more. At German electricity rates (EUR 0.30/kWh), it costs 7.2x more.
The ex-mining AMD cards are fast for their price but still lose to the API. The RX 5700 XT draws 266W system power for ~55 tokens per second — its wide 256-bit memory bus (448 GB/s) makes it the fastest sub-$200 used card. The RX 6900 XT draws 315W for ~60 t/s: 18% more power for 9% more speed over the 5700 XT, because both share a 256-bit bus and differ mainly in clock speed.
For the best consumer GPU (RTX 4090) to break even with Mistral Nemo API on electricity alone, the electricity rate would need to be EUR 0.041/kWh. That is below any residential rate in Europe.
Enterprise GPUs: faster but still losing
Enterprise GPUs have HBM memory with 2-4x the bandwidth of consumer GDDR. They are faster per token but draw more power and sit in servers with higher base consumption. System power assumes +200W for a server chassis (vs +120W for a desktop).
| GPU | VRAM | Bandwidth | 8B Q4 tok/s (est.) | System watts | EUR/MTok (electricity) | vs API | Purchase price (Apr 2026 est.) |
|---|---|---|---|---|---|---|---|
| L40S | 48 GB GDDR6 | 864 GB/s | ~110 | 550W | EUR 0.208 | 5.6x | $7,500-9,000 |
| A100 40GB PCIe | 40 GB HBM2e | 1,555 GB/s | ~195 | 450W | EUR 0.096 | 2.6x | $5,000-8,000 (used) |
| A100 80GB SXM | 80 GB HBM2e | 2,039 GB/s | ~260 | 600W | EUR 0.096 | 2.6x | $12,000-18,000 |
| RTX Pro 6000 | 96 GB GDDR7 | 1,792 GB/s | ~279 (7B bench) | 720W | EUR 0.107 | 2.9x | $8,000-9,200 |
| H100 SXM | 80 GB HBM3 | 3,352 GB/s | ~400 | 900W | EUR 0.094 | 2.5x | $22,000-30,000 |
The H100, the fastest GPU in this comparison, still costs 2.5x more per token in electricity than the API. The L40S is worse than consumer GPUs per token because its 864 GB/s bandwidth is slower than the RTX 4090's 1,008 GB/s but draws more system power in a server chassis.
Enterprise GPUs close the gap but do not flip the math. Where they earn their price is batched serving: an H100 running vLLM with 32 concurrent users achieves aggregate throughput that amortizes the power draw across requests. Single-stream single-user inference, which is what this table measures, is the worst case for expensive hardware.
What this means
Hardware cost, depreciation, driver maintenance, cooling, and noise all come on top of electricity. If electricity alone already loses to the API, the total cost is not getting better.
Self-hosting has real advantages: privacy, zero network latency, offline capability, fine-tuning. Those are legitimate reasons to run local inference. "It is cheaper" is not one of them.
Fine-tuning: where self-hosting wins decisively
Fine-tuning is the strongest capability advantage for self-hosting. A fine-tuned 7B model routinely beats a generic 70B+ model on the specific task it was trained for (arxiv 2406.08660, Stanford). Multi-task fine-tuned Phi-3-Mini (3.8B) surpassed GPT-4o on financial benchmarks (ACM/OpenReview). A fine-tuned 350M model beat ChatGPT by 3x on structured tool calling.
What you need
| Component | Minimum | Recommended |
|---|---|---|
| Dataset | 50-100 examples (classification) | 500-2,000 examples (generation tasks) |
| GPU (QLoRA) | RTX 4060 Ti 16 GB (up to 7B) | RTX 4090 24 GB (up to 14-20B) |
| GPU (full fine-tune) | A100 80 GB (7B) | 4x A100 (32B) |
| Time per run | 1-2 hours (7B, QLoRA) | 3-6 hours (14B, QLoRA) |
| Framework | LLaMA-Factory (beginner, web UI) | Unsloth (2x speed, 70% less VRAM) |
QLoRA is the key: it reduces VRAM requirements by 4-8x versus full fine-tuning with minimal quality loss. A 7B model that needs 60-80 GB for full fine-tuning needs only 5-10 GB with QLoRA. Quality is within 1-2% of full fine-tuning on most tasks.
500 carefully curated examples outperform 5,000 noisy ones. The LIMA paper demonstrated that 1,000 well-crafted examples could produce a competitive instruction-following model. Quality of training data matters more than quantity.
API fine-tuning is limited
| Provider | Models available for fine-tuning | Export weights? | Pricing |
|---|---|---|---|
| OpenAI | GPT-4.1, 4.1 Mini, 4o Mini | No | $0.80-3.00/M training tokens |
| Gemini Flash/Pro (Vertex AI only) | No | Per compute-hour | |
| Anthropic | Claude 3 Haiku only (AWS Bedrock only) | No | AWS compute pricing |
| xAI | None | N/A | N/A |
No API provider lets you export model weights. Your fine-tuned model only works through their API, at their pricing, subject to their deprecation schedule. Self-hosted fine-tuning produces weights you own, deploy anywhere, and keep forever.
API fine-tuning also cannot do: LoRA merging (combine multiple adapters), DPO/GRPO (preference alignment), model merging (TIES/DARE/SLERP), continued pretraining, or knowledge distillation. These techniques are only available with local hardware.
Cost break-even
API costs below include both fine-tuning compute (training runs at $0.80-3.00/MTok) and inference on the fine-tuned model (at elevated fine-tuned model rates). Self-hosted costs include hardware amortization (RTX 4090, 3 years) plus electricity.
| Usage pattern | API cost (1 year, training + inference) | Self-hosted cost (1 year) | Break-even |
|---|---|---|---|
| Occasional (1 training run/month, light inference) | ~EUR 150-300 | ~EUR 1,700 (hardware + power) | ~8-12 months |
| Moderate (weekly retraining, daily inference) | ~EUR 2,000-5,000 | ~EUR 1,750 | ~4 months |
| Heavy (daily retraining, continuous inference) | ~EUR 15,000-50,000 | ~EUR 2,000 | 2-4 weeks |
The inference cost is the driver. Self-hosted inference on a fine-tuned model costs only electricity after hardware; API inference on a fine-tuned model costs $0.80-3.00/M tokens input, forever. At moderate-to-heavy inference volume, the hardware pays for itself quickly.
When NOT to fine-tune
Fine-tuning teaches behavior and domain patterns, not reasoning ability. A fine-tuned 7B cannot match GPT-4 on general reasoning. It can match or beat GPT-4 on the specific narrow task it was trained for.
If knowledge changes frequently, use RAG instead — retrieval stays current without retraining. If factual accuracy with source citation is paramount, RAG provides provenance that fine-tuning cannot. The RAFT study (UC Berkeley) found that combining fine-tuning for behavior with RAG for knowledge outperforms either approach alone.
Catastrophic forgetting is real: fine-tuning performance and general knowledge loss have a strong inverse relationship. LoRA/QLoRA reduces this (fewer parameters modified) but does not eliminate it.
When self-hosting makes sense
If data cannot leave your premises (regulatory, air-gapped, classified), API access is off the table. Self-hosted is the only option and cost comparisons do not apply.
Bulk classification, embedding generation, and document processing with a 7-8B model on modest hardware is cheap and fast. A used RTX 3090 ($800-900) runs Llama 3.1 8B at 87-112 t/s depending on configuration. These tasks do not need frontier quality.
Genuine 24/7 batch workloads keeping a GPU above 80% utilization get real per-token savings. This applies to organizations running inference as a service, not individuals using LLMs as assistants.
Fine-tuning on proprietary data is the clearest self-hosting win — see the section above. A $1,600 RTX 4090 with QLoRA handles models up to 14-20B, and the resulting model is yours to deploy anywhere.
Needle-in-a-haystack retrieval and semantic search over large private document collections are where local inference earns its cost. RAG pipelines that embed and search hundreds of thousands of documents generate millions of tokens per day in embedding and extraction queries. These are high-volume, low-quality-bar tasks where an 8B model is sufficient and API costs would compound. A single GPU running embeddings at 80-100% utilization processes the volume at a fraction of API pricing because the per-token overhead is amortized over sustained throughput.
Embedding models are often the strongest case for local deployment. Embedding models (22M-334M parameters) are distinct from generative LLMs — they produce vector representations for semantic search and RAG, not text. They run on any CPU at hundreds to thousands of embeddings per second with negligible power draw and 1-2 GB RAM total.
| Model | Parameters | Use case | Speed (CPU) | RAM |
|---|---|---|---|---|
| all-minilm-l6-v2 | 22M | fast filtering, similarity | ~5,000 embed/sec | <1 GB |
| nomic-embed-text | 274M | semantic search, RAG | ~1,000 embed/sec | ~1 GB |
| mxbai-embed-large | 334M | high-quality embeddings | ~500 embed/sec | ~1.5 GB |
A 22M embedding model running on any desktop CPU powers local semantic search over thousands of documents with zero API cost. For infrastructure use cases — log analysis, document similarity, RAG retrieval — embedding models deliver more production value than 1-4B generative models. They are the one category where "I already own the hardware" is genuinely true: the compute is trivial, the electricity is negligible, and the privacy benefit is real. API-based embedding (OpenAI ada-002 at $0.10/MTok) adds up fast at scale; local embedding is effectively free after the one-time setup.
Document summarization at scale follows the same pattern. Summarizing 10,000 PDFs through an API costs real money at $0.04-3.00 per million tokens depending on the model tier. On local hardware, the marginal cost is electricity and the model runs as fast as bandwidth allows. A Ryzen 7950X with 64 GB RAM can summarize documents at ~10 t/s on an 8B model continuously without per-request billing. An RTX 3090 does the same at ~87 t/s.
Model stability: no silent changes
APIs ship updates you do not see. Between the February 2026 launch of Claude Opus 4.6 and late March 2026, Anthropic shifted default behavior in ways that measurably degraded output quality for heavy users. The 5 February 2026 Opus 4.6 launch made adaptive thinking the recommended mode, letting the model decide its own reasoning budget per turn. Anthropic's own documentation states plainly: "At the default effort level (`high`), Claude almost always thinks. At lower effort levels, Claude may skip thinking for simpler problems," and warns in the tuning section that "Steering Claude to think less often may reduce quality on tasks that benefit from reasoning." Claude Code harness defaults shifted during this window, and trade press reported peak-hour session-cap tightening in late March. Each change appeared in changelogs or third-party reporting. None was broadcast to users as a quality change. The combined effect was indistinguishable from a silent quality regression.
What Anthropic actually promises
In its September 2025 engineering postmortem, Anthropic states the narrow, auditable commitment: "We never reduce model quality due to demand, time of day, or server load." That is a specific promise about one mechanism class — load-based quality throttling. It is not a promise that user-perceptible experience is stable. The same postmortem disclosed three concurrent infrastructure bugs affecting Sonnet 4, Opus 4.1, Opus 4, and Haiku 3.5 between 5 August and 18 September 2025: a context-window routing error that peaked at 16% of Sonnet 4 requests on 31 August, an output-corruption bug that produced Thai and Chinese characters in English responses between 25 August and 2 September, and an approximate top-k XLA:TPU miscompilation affecting Haiku 3.5 from 25 August onward. All three were infrastructure problems, not deliberate model changes. None were externally visible as defects until Anthropic published the postmortem.
Anthropic's own documentation also contains a striking admission about the Opus 4.7 rollout. The adaptive thinking documentation notes that the `thinking.display` default changed from `summarized` to `omitted` when Opus 4.7 shipped, and describes the change in its own words: "This is a silent change from Claude Opus 4.6, where the default was `summarized`." The provider itself uses the phrase "silent change" to describe a user-visible behavior shift between model versions at the same identifier pattern. This is the structural issue, not Anthropic-specific malice: API defaults drift between releases, and the only way a customer learns is by reading every changelog entry against every model version they depend on.
Evidence, measured and debunked
Telemetry from 6,852 Claude Code sessions filed in GitHub issue #42796 by AMD engineer Stella Laurenzo measured a 73% collapse in visible reasoning depth between January and March 2026 (median thinking chars falling from ~2,200 to ~600), a Read:Edit tool-call ratio falling from 6.6 to 2.0, and an 80x increase in deduplicated API request volume. Concurrent session scale-up explains roughly 5-10x of that multiplier; the remaining 8-16x tracks degradation-induced thrashing — retries, corrections, and wrong outputs that would not have been needed had the model reasoned properly the first time. Input-token volume rose 170x between February and March (120M to 20.5B), estimated daily cost rose 122x ($12 to $1,504), reasoning loops per thousand tool calls rose 156%, and user-interrupt events per thousand tool calls rose 556%. Anthropic shipped Opus 4.7 within two weeks of the report gaining traction, and raised the default Claude Code harness effort level to `xhigh`.
Not every "nerf proof" holds up. The most-shared visual evidence — a BridgeBench comparison showing Opus scoring 83% before the alleged nerf and 68% after — failed basic methodological review: the first run measured 6 tasks and the retest measured 30 tasks, so the two scores were not comparing like with like. On the overlapping six tasks, the before-and-after numbers were 87.6% and 85.4% — within normal run-to-run variance. The viral result was noise amplified by confirmation bias. Rumors persist that providers silently quantize weights during peak load — dropping Opus to Int4 or 1.58-bit precision, or compressing the KV cache, to save compute on expensive models. No weight probe, perplexity study, or latency fingerprint has been published showing quantization artifacts, and the symptoms heavy users describe match reasoning-budget cuts and harness bloat, not the perplexity spikes that characterize aggressive quantization. The quantization theory remains unverified speculation — but it is unfalsifiable without provider-side access, which is precisely the problem. The operator of a hosted API has the ability to do this; the customer has no way to prove they did or did not.
The silent-downgrade bug class
A separate cluster of complaints — "I paid for Opus, I got Sonnet" or "I got Haiku" — turns out to be a real, documented bug class, but not a policy. GitHub issue #19468 and its duplicates (#3434, #6602, #13242, #17966) document client-side failures in the Claude Code harness: falling back to Sonnet after an Opus quota cap, serving Sonnet despite explicit Opus configuration, configuration files being ignored after harness updates, and reauthentication silently reverting the selected model to Haiku. These are implementation defects in the client, not backend model substitution. The user experience is identical to a hidden downgrade, which is what makes the deception theory persistent even when the mechanism is actually a bug. The transparency failure is real; the malice is not established.
The cross-provider pattern
This pattern is not Anthropic-specific. Chen, Zaharia, and Zou (2023) measured GPT-4 falling from 84% to 51% on prime-number identification between March and June 2023 on identical prompts — controlled snapshot comparison, published in a peer-reviewed preprint. OpenAI acknowledged GPT-4 Turbo "laziness" in December 2023 and shipped a fix in January 2024. Independent researchers found GPT-4 producing roughly 5% shorter responses when prompted with a December date vs. a May date, suggesting the model had internalized holiday-slowdown patterns from training data. Gemini 2.5 Pro faced identical community complaints in Q4 2025 and Q1 2026 — hallucinations, timeouts, degraded reasoning — with users speculating Google was starving Pro of TPU capacity to make room for 3.0 (unconfirmed). The underlying causes need not require malice: task-difficulty ratchet as users push harder problems, novelty effect wearing off, RLHF regression, silent product-configuration changes, infrastructure bugs, and selection bias among power users who post loudly when things feel worse. The universality is Bayesian evidence against any specific vendor being uniquely deceptive. It is also Bayesian evidence that any API-dependent operation will eventually experience this.
Reproducibility is structural
A self-hosted model with a given SHA256 hash produces the same output distribution today, in a year, and in five years. The reasoning budget is whatever VRAM and latency tolerance you allocate, not what a classifier at the provider decides under peak load. There is no adaptive-thinking mis-budgeting, no prompt-cache bug inflating cost 10-20x, no 1M context recall degrading past roughly 48% utilization as documented in GitHub issue #35296, no silent client-side downgrade from Opus to Sonnet after a quota hit, no silent display-default flip between model versions, no auto-compact summarising away the context you needed, and no rumored peak-hour quantization that cannot be confirmed or denied. Changes happen when you update the weights file. Reverting is copying the old file back. Pinning to a specific checksum is a guarantee the provider cannot make.
For API-dependent workloads that cannot move to local inference, the practical mitigations are narrow: set `effort` to `high` or `max` explicitly rather than relying on defaults, pin `thinking.type` and `thinking.display` explicitly rather than letting model-version upgrades change them silently, treat 256K–400K as the practical context limit even on advertised 1M windows, and document the exact model snapshot (e.g. `claude-opus-4-6-20260201`) your pipeline was validated against. Priority Tier for production workloads gives predictable quota behavior. Cross-provider redundancy reduces single-vendor dependency. None of these mitigations restore byte-identical reproducibility — they buy you fewer surprises, not zero.
For workloads where reproducibility matters more than frontier quality — regulated industries with audit requirements, long-horizon research spanning model versions, agent pipelines where behavior changes break production — model stability is a feature APIs structurally cannot match. It does not show up in cost-per-token comparisons. It compounds over years.
When APIs win
For coding, multi-step reasoning, and agentic workflows, APIs win on both quality and effective cost.
A B+ local model looks adequate on simple prompts. The gap shows on hard problems: multi-file refactoring, agent coherence past 20 turns, judgment calls in ambiguous situations. The output compiles. It does the wrong thing. These are the tasks where LLMs are worth using, and where frontier models justify their pricing.
At 20-30% utilization, APIs often cost less per token than self-hosted inference. Add maintenance (driver bugs, cooling, kernel compatibility, GPU replacement) and the gap widens.
The DeepSeek V3 case study
DeepSeek V3.2 API costs about EUR 260/month for a moderate coding workload (130K queries/month averaging ~5K tokens each — typical for code completion, short Q&A, and lightweight agent calls). At heavier usage with long context (e.g. 74K tokens per query), the same 130K queries would cost ~EUR 2,500/month. Self-hosting the full DeepSeek V3 model requires 8x H100 SXM GPUs.
| Configuration | Upfront cost | Monthly operating | vs API (EUR 260/mo, moderate) | vs API (EUR 2,500/mo, long-context) |
|---|---|---|---|---|
| DeepSeek V3 on 8x H100 (new) | EUR 320K | EUR 9,600 | 37x more | 3.8x more |
| DeepSeek V3 on 8x H100 (used) | EUR 200K | EUR 6,260 | 24x more | 2.5x more |
| DeepSeek V3 on 2x MI300X | EUR 37K | EUR 1,230 | 5x more | 0.5x (cheaper) |
| Qwen3-30B-A3B on 2x RTX 4090 (substitute) | EUR 7,600 | EUR 350 | 35% more (lower quality) | 7x cheaper (much lower quality) |
At moderate usage (EUR 260/month), self-hosting loses badly. At heavy long-context usage (EUR 2,500/month), self-hosting on MI300X starts to break even — but you are running the full 685B model, which requires specialized hardware. The consumer GPU substitute (Qwen3-30B-A3B) is a different, smaller model with lower quality.
The API is almost certainly priced below marginal cost for individual users. DeepSeek benefits from massive scale, strategic pricing, and Chinese electricity rates. At moderate usage, competing on price by buying your own hardware is a losing proposition. At heavy long-context usage, self-hosting can break even on operating costs — but the upfront hardware investment (EUR 37K-320K for the full model) means years to payback.
API pricing is variable: pay for what you use. GPU hardware is a fixed cost that depreciates regardless. Workloads that vary week to week waste less money on APIs.
At Pulsed Media
Pulsed Media runs its own hardware in owned datacenter space in Finland. GPU inference cards like the RTX Pro 6000 go through the same depreciation, power cost, and rack space analysis as any other piece of infrastructure.
For most workloads in 2026, cloud APIs deliver better quality per euro than local inference. Bulk classification, embeddings, and privacy-constrained tasks are where the GPU math works. 16 years of hardware purchasing says: buy when the math supports it, use the API when it does not.
Seedboxes and dedicated servers from Pulsed Media run on the same Finnish datacenter infrastructure, with the same cost-per-unit discipline applied to every piece of hardware. See current plans.
References
Benchmarks and performance data
- LMSYS Chatbot Arena ELO Leaderboard — model quality rankings used for Arena ELO scores throughout this article
- llama.cpp Vulkan GPU Scoreboard — community-submitted GPU inference benchmarks (Llama 2 7B Q4_0, tg128). Source for RX 5700 XT (70.73 t/s), GTX 1050 Ti (20.96 t/s), RTX 3060 (75.94-80.59 t/s), and other consumer GPU speeds
- llama.cpp — the inference engine behind Ollama and most local LLM benchmarks
- vLLM — batched serving engine; source for Pro 6000 throughput numbers (8B at 8,990 t/s, 70B at 1,031 t/s)
Model cards and technical reports
- Gemma 4 model family — 31B dense and 26B-A4B MoE variants. Arena ELO ~1450 (31B) and ~1441 (26B-A4B)
- Llama 3.3 70B — the 70B dense model used for Pro 6000 and consumer GPU benchmarks
- Llama 3.1 8B — the 8B model used as baseline for electricity cost and bandwidth comparisons
- Qwen3 model family — 0.6B through 235B; 30B-A3B MoE benchmark source
- Phi-4 Mini 3.8B model card — MMLU 67.3%, GSM8K 88.6%
- BitNet b1.58 2B-4T model card — MMLU 53.2%, energy and latency benchmarks
- bitnet.cpp — native ternary inference framework
Research papers
- bitnet.cpp: Efficient Edge Inference for Ternary LLMs (Microsoft Research, 2024) — energy per token and CPU latency measurements for BitNet models
- AQLM: Extreme Compression of LLMs via Additive Quantization (ICML 2024) — 2-bit quantization with 2-5% quality degradation
- QTIP: Quantization with Trellises and Incoherence Processing (NeurIPS 2024 Spotlight) — 2-bit quantization with 3x+ inference speedup
- Chroma 2025 study — context rot: 20-50% accuracy degradation from 10K to 100K tokens across 18 frontier models
- Liu et al. 2024 — "lost in the middle" effect: systematic attention bias toward beginning and end of context
- Sequential-NIAH (EMNLP 2025, arXiv 2504.04713) — best model scored 63.5% on ordered extraction across long context
- NoLiMa (ICML 2025) — non-literal long-context evaluation: semantic retrieval degrades worse than passkey benchmarks suggest
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Google Research, ICLR 2026) — KV cache compression to 2.5-3.5 bits with near-zero quality loss
CPU inference and memory bandwidth
- AMD EPYC 9554 llama.cpp benchmarks (ahelpme.com) — EPYC 9554 achieves 50 t/s on 8B Q4_K_M, confirming memory bandwidth as primary bottleneck
- OpenBenchmarking.org llama.cpp results — 88 public CPU benchmark results, median 14.2 t/s on 8B Q8_0
- ik_llama.cpp CPU performance — Ryzen 7950X benchmarks showing 1.8-5.2x speedup over mainline llama.cpp for prompt processing
- ik_llama.cpp — optimized llama.cpp fork with improved CPU kernels
- CPU token rate formula: theoretical max t/s ≈ memory bandwidth (GB/s) / model size (GB); actual ~50-70% of theoretical due to overhead
Hardware specifications
- Speeds in this article use Q4_K_M quantization unless otherwise noted. Consumer GPU system power includes GPU + ~120W for the rest of the system; enterprise GPUs use +200W for server chassis. Electricity rate: EUR 0.15/kWh (Finnish residential average). Enterprise GPU purchase prices are estimated April 2026 market rates and shift rapidly
- VRAM context limits use Llama 3.1 8B KV cache parameters (128 MB/1K tokens at FP16, 64 MB/1K at Q8, 32 MB/1K at Q4). Other model architectures vary; Qwen3 8B uses ~144 MB/1K and Gemma 4 31B dense uses ~850 MB/1K due to its 16 KV-head architecture
Model stability and provider changes
- Primary sources (Anthropic)
- Anthropic Engineering — postmortem of three recent issues (September 2025) — context-window routing error (5 Aug–18 Sep, peaked at 16% of Sonnet 4 requests), output corruption with Thai/Chinese characters (25 Aug–2 Sep), and top-k XLA:TPU miscompilation affecting Haiku 3.5 (25 Aug onward). Contains the quality commitment quote: "We never reduce model quality due to demand, time of day, or server load."
- Anthropic — Adaptive Thinking documentation — defines effort levels (`low`, `medium`, `high` default, `xhigh`, `max`) and acknowledges that at lower effort levels Claude may skip thinking for simpler problems; includes the "silent change" phrasing for the Opus 4.7 `display` default shift from `summarized` to `omitted`
- Anthropic — Extended Thinking documentation — reference for manual `budget_tokens` mode, now deprecated on Opus 4.6 and Sonnet 4.6, and removed entirely on Opus 4.7
- Anthropic API release notes — dated changelog including the 5 February 2026 Opus 4.6 launch (adaptive thinking recommended), 16 April 2026 Opus 4.7 launch, and interim platform changes
- Anthropic — Claude Opus 4.6 launch announcement
- Anthropic — Claude Opus 4.7 launch announcement (16 April 2026) — raises the default Claude Code harness effort to `xhigh`; on the API, adaptive thinking becomes the only supported thinking mode for Opus 4.7
- User telemetry and bug reports
- GitHub issue anthropics/claude-code #42796 — AMD engineer Stella Laurenzo's telemetry from 6,852 Claude Code sessions: 73% reasoning-depth collapse (~2,200 → ~600 chars), Read:Edit ratio collapse (6.6 → 2.0), 80x increase in deduplicated API request volume, 170x input-token growth, 122x daily-cost growth, 156% reasoning-loop growth, 556% user-interrupt growth (January → March 2026)
- GitHub issue anthropics/claude-code #35296 (filed 17 March 2026) — Claude Code user documenting 1M context window degradation, including the model itself recommending session restart by 48% context utilization
- GitHub issue anthropics/claude-code #19468 — tracks the client-side Opus→Sonnet/Haiku downgrade bug class (duplicates #3434, #6602, #13242, #17966): settings ignored, quota fallback without notice, reauthentication reverting to Haiku
- GitHub issue anthropics/claude-code #16073 — early January 2026 quality-regression complaint with session-level detail
- GitHub issue anthropics/claude-code #49593 — ~14% context-window bloat introduced in Claude Code v2.1.111; Tool Search cut MCP overhead 46.9% (51K → 8.5K tokens), indicating the baseline was severe
- Cross-provider and historical context
- Chen, Zaharia, Zou — How is ChatGPT's Behavior Changing over Time? (arXiv 2307.09009, July 2023) — controlled measurement of GPT-4 falling from 84% to 51% on prime identification between March and June 2023 on identical prompts; the canonical academic citation for frontier-model drift
- OpenAI public acknowledgment of GPT-4 Turbo "laziness" (December 2023) — statement that the model had not been updated since 11 November, yet user behavior had shifted; fix shipped January 2024
- Trade press coverage of the 2026 Claude episode
- Fortune (14 April 2026) — customer-backlash reporting tied to the adaptive-thinking default and effort-level reductions
- The Register (31 March 2026) — coverage of the Claude Code quota-blowup episode, including Anthropic's acknowledgment that users were hitting limits faster than expected
- InfoWorld (late March 2026) — reporting on Claude Code peak-hour session-cap tightening (US weekday 5am–11am PT / 1pm–7pm GMT window)
Trade-press citations above are included for context and were used as corroboration; primary-source material from Anthropic's own documentation, postmortems, and GitHub issue tracker is preferred for any factual claim in this article.
See also
- NVIDIA RTX Pro 6000 (Blackwell) — full specs, benchmarks, and known issues for the 96 GB GPU referenced throughout this article
- Seedbox — PM's primary product, where the same hardware cost economics apply
- NVMe — storage interface used alongside GPU inference servers
- RAID — storage redundancy in PM's infrastructure