Jump to content

Self-Hosting LLMs vs API

From Pulsed Media Wiki


Pulsed Media owns its datacenter space and runs its own hardware. GPU inference cards get evaluated against the same cost-per-unit math that prices seedboxes and dedicated servers. In April 2026, the math says: cloud APIs win for quality-sensitive work, self-hosted GPUs win for bulk processing and privacy-constrained workloads.

This article has the hardware benchmarks, API pricing, and economics to show why.

The quality problem

The first obstacle is not cost. It is quality.

The best open-weight model you can run on a single consumer GPU (24 GB) is Gemma 4 31B. It scores approximately 1450 on the LMSYS Chatbot Arena ELO scale (arena.ai, preliminary, April 2026). That is 20 points below Sonnet and 50 below Opus. The gap sounds small until you measure it on hard tasks.

Model Arena ELO (approx) Access Hardware needed
Claude Opus 4.6 ~1500 API / subscription Cloud
Claude Sonnet 4.6 ~1470 API / subscription Cloud
GLM-5 744B (dense) ~1451 Open weights 6x 96 GB GPUs (~$48,000)
Gemma 4 31B (dense) ~1450 Open weights 1x 24 GB GPU at Q4 (~$1,600)
Kimi K2.5 1T (MoE) ~1447 Open weights 8x 96 GB GPUs (~$64,000)
Gemma 4 26B-A4B (MoE) ~1441 Open weights 1x 24 GB GPU at Q4 (~$1,600)
Qwen3-235B-A22B (MoE) ~1422 Open weights 2x 96 GB GPUs (~$16,000)
DeepSeek V3.2 685B (MoE, 37B active) ~1421 Open weights 5x 96 GB GPUs (~$40,000)
DeepSeek R1 671B (MoE, reasoning) ~1398 Open weights / API 5x 96 GB GPUs (~$40,000)
Llama 3.3 70B (dense) ~1350-1400 Open weights 1x 96 GB GPU (~$8,500)

The table splits into two tiers. Models you can run on a single GPU ($1,600-8,500) top out at ELO ~1450. Models that approach Sonnet territory (GLM-5 at 1451, Kimi K2.5 at 1447) require multi-GPU setups. The costs shown assume RTX Pro 6000 cards at ~$8,000 each via PCIe — but without NVLink, tensor parallelism runs at roughly one-third the throughput of datacenter H100s. An NVLink-equipped H100 SXM configuration running the same models costs 3-5x more ($150,000-320,000).

Gemma 4 31B scores 80.0% on LiveCodeBench v6 and 84.3% on GPQA Diamond. These are strong numbers for an open model. But on agentic workflows where models must maintain coherence across 20+ tool calls, open models fail silently: the output compiles but does the wrong thing, or the agent loops instead of converging. These failures are expensive because nobody notices until the result is wrong.

No amount of hardware spending changes this. The quality ceiling is in the model weights. Even $64,000 in GPUs running Kimi K2.5 does not reach Sonnet.

Quality vs hardware: what each GPU tier can actually run

The quality gap depends on what fits on the hardware you own.

VRAM Best model (Q4) Arena ELO (full precision) Quality tier Gap to Sonnet
4 GB Phi-4 Mini 3.8B or smaller ~1100-1200 D ~270+ ELO
8 GB Llama 3.1 8B (tight) ~1250-1300 C/C+ ~170-220 ELO
12 GB Qwen3 14B ~1350 B- ~120 ELO
16 GB Qwen3 14B (comfortable) ~1350 B- ~120 ELO
24 GB Gemma 4 31B ~1450 B+/A- ~20 ELO
32 GB Gemma 4 31B at Q8 ~1450+ A- ~20 ELO
48 GB Llama 3.3 70B Q4 ~1350-1400 B/B+ ~70-120 ELO
96 GB Llama 3.3 70B Q8 ~1400 B+ ~70 ELO

Arena ELO scores above were measured at full precision (BF16/FP16) on the leaderboard, not at Q4. Running at Q4_K_M typically costs 0.3-1 benchmark points (see quantization quality section), so actual Q4 performance is slightly below these numbers — within the "~" approximation for most models, but measurable on hard tasks.

At 4-8 GB VRAM, you are running models that struggle with multi-step reasoning and produce noticeably worse output than a $0.04/MTok API call to Mistral Nemo. At 24 GB, Gemma 4 closes the ELO gap but still falls short on the hardest tasks. Only at 96 GB do you get a 70B model at full quality, and even that is still B+ tier versus API frontier at A+/S.

What 4-8 GB VRAM can run

Most consumer GPUs sold in the last five years have 4-8 GB of VRAM. Can they do anything useful with local LLMs?

4 GB (GTX 1050 Ti, RX 570, integrated graphics)

At 4 GB, the only models that fit are sub-3B at Q4:

Model Size at Q4 Speed (est.) Useful for
Qwen3 0.6B ~0.5 GB 30-50 t/s Text classification, simple extraction
Llama 3.2 1B ~0.8 GB 20-30 t/s Basic summarization, translation
SmolLM2 1.7B ~1 GB 20-30 t/s Summarization, rewriting, function calling
BitNet b1.58 2B 0.4 GB 10-20 t/s (CPU) Classification, simple Q&A (MMLU 53)
Phi-4 Mini 3.8B ~2.3 GB 15-25 t/s Simple coding, Q&A

These models cannot replace a general-purpose assistant: multi-step reasoning, complex code generation, and long conversations are out of reach. But they are not useless. Fine-tuned sub-2B models beat GPT-4o on structured extraction (NuExtract-tiny 0.5B) and outperform zero-shot GPT-4 on classification tasks with as few as 60-75 training examples per class (arxiv 2406.08660). For single-task pipelines — classification, extraction, routing, PII redaction — a fine-tuned 0.5-2B model on a 4 GB GPU is a legitimate production tool, not a toy.

8 GB (RTX 3060 8GB, RTX 4060, RX 6600)

At 8 GB, a 7B/8B model at Q4 fits with room for ~32K context (Q8 KV cache):

Model Size at Q4 Speed (GPU) Max context (Q8 KV) Quality
Llama 3.1 8B ~4.8 GB ~40-50 t/s ~32K C+
Qwen3 8B ~5.2 GB ~35-45 t/s ~28K C+
Gemma 4 E4B ~5 GB ~35-45 t/s ~28K C
Mistral Nemo 12B ~7.4 GB ~25-35 t/s ~4-8K C+

An 8B model on an 8 GB GPU is the minimum setup that produces output comparable to budget APIs. The constraint is context: at Q4 KV cache, 64K tokens are feasible but tight. At Q8 KV (recommended for retrieval tasks), the ceiling is ~32K.

Mistral Nemo 12B fits at Q4 but leaves almost no room for KV cache. Short conversations only.

Memory bandwidth is the bottleneck

LLM token generation is memory-bandwidth-bound. The GPU reads model weights from VRAM for every single token. Read speed determines generation speed.

Platform Memory bandwidth Llama 3.3 70B Q4 tok/s Relative speed
RTX Pro 6000 1,792 GB/s ~34 1.0x
H100 SXM 3,352 GB/s ~40 1.2x
Strix Halo (128 GB unified) 215 GB/s ~4.5 0.13x
DDR5 desktop (dual channel) ~89 GB/s ~2.2 0.06x
DDR4-3200 desktop (dual channel) ~42 GB/s ~1 0.03x

The Pro 6000 and H100 are close on single-card workloads because both have enough bandwidth to keep a 70B model fed. The Pro 6000 compensates with more aggressive quantization (Q4 vs FP8), fewer bytes per weight, similar throughput. The H100 pulls ahead with NVLink multi-GPU tensor parallelism, where its 900 GB/s interconnect leaves PCIe (128 GB/s) behind.

A Strix Halo system has 215 GB/s. Same model, same quantization, 8.3x less bandwidth, roughly 8x slower. A desktop CPU on DDR5 dual-channel has ~89 GB/s. 20x slower than the Pro 6000.

No amount of CPU cores or compiler flags changes this. Bandwidth is physics. The cheapest path to 1,792 GB/s in April 2026 is the RTX Pro 6000 at $8,500.

The rough formula: max tokens/sec = Memory Bandwidth (GB/s) / Model Size (GB). Actual throughput is about 50-70% of theoretical. An EPYC 9554 with ~500 GB/s measured bandwidth hits 50 t/s on an 8B Q4 model, which is 55% of the theoretical 111 t/s.

Hardware comparison

Professional GPUs

The RTX Pro 6000 is the only GPU under $10,000 that runs 70B models on a single card without quantizing below Q4. Full specs and known issues are on the dedicated page.

Model Quantization Tokens/sec (generation)
Llama 3.1 8B Q4_K_M ~185
Mistral Nemo 12B Q4_K_M 163
Qwen3 30B-A3B (MoE) Q4_K_M 252
Llama 3.3 70B Q4_K_M 34

34 tokens per second on 70B is about 25 words per second. Fast enough that you are not waiting for the model to finish a paragraph.

With vLLM batched serving (concurrent requests), throughput goes well beyond single-stream: 8B at 8,990 t/s, 70B at 1,031 t/s. On models that fit on one card, the Pro 6000 matches the H100 SXM at one-third the cost.

The card has real problems. A virtualization reset bug causes unrecoverable GPU state after VM shutdown. Sustained vLLM inference triggers chip resets at temperatures as low as 28C. The SM120 kernel architecture breaks DeepSeek models. Only the open-source driver works on Blackwell; there is no proprietary option.

Consumer GPUs

GPU VRAM Bandwidth Llama 3.1 8B Q4 t/s Max model (Q4) Price (Apr 2026)
RTX 5090 32 GB GDDR7 1,790 GB/s ~300 ~32B $2,900-4,200
RTX 4090 24 GB GDDR6X 1,008 GB/s ~128 ~24B $1,600-2,200
RTX 3090 24 GB GDDR6X 936 GB/s ~87-112 ~32B tight $800-900 used
RTX 4000 SFF Ada 20 GB GDDR6 280 GB/s ~64 ~13B ~$1,250

The RTX 5090 has bandwidth matching the Pro 6000 (1,790 GB/s) but only 32 GB of VRAM, which limits it to 32B-class models at Q4. For sub-32B workloads, the 5090 is better price-performance than the Pro 6000.

The RTX 3090 is the last consumer card with NVLink. Two of them ($1,600-1,800 used) give 48 GB combined, enough for 32B models with generous context. For the price of a single RTX 5090, a dual-3090 setup gets similar bandwidth and 50% more VRAM.

Gemma 4 on consumer GPUs: dense vs MoE

Gemma 4 is available in two forms that fit on 24 GB: the 31B dense model and the 26B-A4B MoE. The MoE activates only 3.8B parameters per token, making it dramatically faster.

Variant Arena ELO RTX 4090 speed RTX 3090 speed VRAM (Q4) 256K context?
31B Dense ~1450 ~25-35 t/s (short ctx) ~20-35 t/s 19.6 GB No (KV cache fills 24 GB at ~32-64K)
26B-A4B MoE ~1441 ~50-129 t/s ~35-40 t/s ~15.6 GB Yes (8.4 GB headroom for KV)

The 31B dense model has a disproportionately large KV cache (~0.85 MB per token at BF16) because of its 16 KV-head architecture. Despite fitting at Q4, context expansion rapidly consumes remaining VRAM. At VRAM-saturated configurations, the RTX 4090 drops to 7.8 t/s on the 31B.

The 26B-A4B MoE uses 4x less KV cache (~0.21 MB/token), runs 2-4x faster, and scores within 9 ELO points. For interactive use on 24 GB hardware, the MoE is the correct choice. The dense model is worth choosing only for maximum coding quality at short context (<8K tokens).

Compact workstations

The Minisforum MS-02 Ultra won a CES 2026 Innovation Award. Compact 4.8L chassis, Intel Core Ultra 9 285HX, up to 256 GB DDR5 ECC, four NVMe slots. Looks great on paper.

The catch: the PCIe slot is low-profile dual-slot only. The best GPU that fits is an RTX 4000 SFF Ada with 20 GB VRAM and 280 GB/s bandwidth. That gets ~64 t/s on an 8B model and cannot run anything above 13B. The MS-02 Ultra is a homelab machine, not an inference server.

Strix Halo

AMD's Ryzen AI Max+ 395 (Strix Halo) puts 128 GB of unified LPDDR5x memory on a single chip. The iGPU can address up to 96 GB as VRAM, matching the RTX Pro 6000 on capacity.

Bandwidth tells the real story: 215 GB/s measured, versus the Pro 6000's 1,792 GB/s. Every model runs 4-8x slower. 70B at 4.5 tokens per second works, but it is painfully slow.

Where Strix Halo is useful: running large MoE models that would not fit on a 24-32 GB consumer GPU. Qwen3 30B-A3B at 66-72 t/s in 128 GB unified memory, or larger MoE models at 20+ t/s. The 128 GB pool is the point, not the speed.

At EUR 2,000-3,000 for a complete system consuming 120W, it costs 3-5x less than an RTX Pro 6000 setup. Same model capacity, 8x less speed.

CPU inference

Server-class CPUs with enough memory channels can run LLMs at usable speeds for small models:

CPU Memory bandwidth 8B Q4 t/s 70B Q4 t/s Usable for chat?
EPYC 9554 (64c, 8-ch DDR5) ~500 GB/s ~50 ~7 8B yes, 70B batch only
Dual Xeon Gold 5317 ~80 GB/s ~22 ~3 8B marginal, 70B no
Desktop DDR5 (dual-channel) ~89 GB/s ~20 ~2 8B marginal, 70B no
Desktop DDR4 (i5-7500T) ~38 GB/s ~4-5 Basic completion only

CPU inference makes sense when you need to run a model that does not fit in any available VRAM and buying a GPU is not an option. An EPYC with 12-channel DDR5 and 512+ GB RAM can run 70B at 7 t/s. Batch processing, not interactive.

The optimized fork ik_llama.cpp gets 1.8-5x faster prompt processing than mainline llama.cpp on CPUs with AVX-512, though generation speed gains are more modest.

The smallest useful CPU models

For CPU-only machines without a GPU, the models worth considering are limited:

Model Parameters RAM needed Desktop CPU speed Quality
BitNet b1.58 2B4T 2.4B ~0.4 GB 10-20+ t/s Classification, simple Q&A; MMLU 53
SmolLM2 1.7B 1.7B ~1 GB 15-25 t/s Summarization, rewriting; beats Llama 3.2 1B
Llama 3.2 1B 1.3B ~0.8 GB 20-30 t/s Basic tasks; best gains from fine-tuning
Llama 3.2 3B 3.2B ~2 GB 10-15 t/s Simple Q&A, summarization
Phi-4 Mini 3.8B 3.8B ~2.3 GB 7-10 t/s Simple coding, reasoning
Llama 3.1 8B 8B ~4.8 GB 4-5 t/s (DDR4) / 15-20 t/s (DDR5) C+ tier; minimum useful

Models under 3B parameters are not general-purpose assistants, but they have real production uses beyond research. SmolLM2 1.7B outperforms Llama 3.2 1B on most benchmarks. Gemma 3n E2B runs in 2 GB RAM and exceeds 1300 Elo on LMArena — the first sub-10B model to do so. Fine-tuned sub-2B models match or exceed frontier models on focused tasks like structured extraction and classification (see the 4 GB VRAM section).

Where tiny models add the most value in a self-hosted stack:

  • Classification and routing — a 0.5B classifier routes queries to the right model tier, saving inference cost on easy queries
  • Structured extraction — parsing invoices, resumes, forms into JSON
  • Speculative decoding — a tiny draft model proposes tokens, a large model verifies in parallel, achieving 2-3x speedup with zero quality loss (Google Research)
  • PII redaction — well-defined token-level task where small models excel
  • Edge deployment — Raspberry Pi 5 runs a 1B model at 7-15 t/s for offline classification or sensor data processing

The 8B tier on CPU is the minimum for genuinely useful general-purpose output. Below that, you are trading generality for specialization — a valid tradeoff for focused pipelines, privacy-constrained environments, and edge deployment.

The context window problem

Real workloads routinely exceed 128K tokens. A medium codebase is 200K-500K tokens. A 50-page contract is 40K tokens. An agentic session with 30+ tool calls accumulates 100K-300K tokens. Document collection analysis, multi-session memory, and RAG over large corpora push past 500K easily.

Gemini 2.5 Pro handles 1 million tokens natively. Claude Opus 4.6 and Sonnet 4.6 handle 1M. GPT-4.1 takes 1M. xAI's Grok 4.20 takes 2M. These are the context windows where real work happens.

Open models top out at 128K. That is not a VRAM limitation — it is a training limitation. No open model available in April 2026 was trained on context beyond 128K-256K, and quality degrades well before the stated maximum. This is the single largest gap between self-hosted and API inference, and no amount of hardware closes it.

VRAM limits on context

Every token in context consumes VRAM for its KV cache entry. The formula: 2 x layers x KV_heads x head_dim x bytes_per_element. For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim), that is ~128 MB per 1,000 tokens at FP16. Q8 KV halves that to ~64 MB/1K with negligible quality loss. Q4 KV halves again but degrades retrieval accuracy.

Trained context windows cap the usable range regardless of VRAM:

Model Trained context Notes
Llama 3.1 / 3.3 128K Best open 70B; hard ceiling at 128K
Qwen3 128K YaRN-extended; quality degrades past training range
Gemma 4 128K
Mistral Nemo 128K
Gemini 2.5 Pro (API) 1,000K 8x the best open model
Claude Opus 4.6 / Sonnet 4.6 (API) 1,000K 8x the best open model
GPT-4.1 (API) 1,000K 8x the best open model
Grok 4.20 (xAI API) 2,000K 16x the best open model

Every open model stops at 128K. All four frontier API providers — Anthropic, Google, OpenAI, and xAI — now offer 1M-2M token context windows. For workloads that routinely exceed 128K — codebase analysis, long document chains, agent sessions — this gap is not solvable with more VRAM. The model simply was not trained for it.

All context numbers below use Q8 KV cache (recommended — negligible quality loss vs FP16, half the VRAM). Q4 KV doubles these limits but degrades retrieval accuracy. Models at Q4_K_M weights throughout.

VRAM Model Max context (Q8 KV) Max context (Q4 KV) Notes
8 GB 8B ~36K ~73K Minimum useful setup
14B+ Does not fit
12 GB 8B ~112K 128K (trained limit) 128K with Q4 KV only
14B ~43K ~86K
16 GB 8B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
14B ~93K 128K (trained limit)
24 GB 8B 128K (trained limit) 128K (trained limit) Full context
14B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
32B ~35K ~70K Short context only
48 GB 14B 128K (trained limit) 128K (trained limit) Full context
32B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
70B ~46K ~92K Short-to-medium context
80 GB 32B 128K (trained limit) 128K (trained limit) Full context
70B ~125K 128K (trained limit) Full context at Q4 KV; near-full at Q8
96 GB 70B 128K (trained limit) 128K (trained limit) Full context; best single-GPU for 70B
128 GB (Strix Halo / CPU) 70B 128K (trained limit) 128K (trained limit) Full context; ~4.5 t/s (slow)

KV cache per token at Q8: 8B = 64 MB/1K, 14B = 80 MB/1K, 32B = 128 MB/1K, 70B = 160 MB/1K. Q4 halves these. 1 GB headroom reserved for runtime overhead.

For CPU/RAM inference, the same arithmetic applies but "VRAM" becomes "available RAM after OS and model." A 64 GB DDR4 system running 8B Q4 (4.7 GB model + OS overhead) has roughly 55 GB available for KV cache — same capacity as a 48 GB GPU, but 10-15x slower due to DDR4 bandwidth (~42 GB/s vs GPU's 900+ GB/s).

Context rot: accuracy degrades with length

Even when the VRAM fits, model accuracy drops as context grows. The Chroma 2025 study tested 18 models and found 20-50% accuracy degradation from 10K to 100K tokens across every model tested, including frontier APIs. Open models degrade faster.

The worst-case position is the middle of the context window. Beginning and end receive stronger attention (the "lost in the middle" effect, Liu et al. 2024). For retrieval tasks where the target information can appear anywhere, this creates systematic blind spots.

On harder retrieval tasks requiring ordered extraction across long context (Sequential-NIAH, arXiv 2504.04713, EMNLP 2025), the best model tested scored 63.5%. Standard "find the passkey" benchmarks show 90-100% at 128K; the realistic number for semantic retrieval is 50-65%. This is a model capability ceiling, not a hardware limitation.

Non-literal retrieval is worse still. The NoLiMa benchmark (ICML 2025) tested 13 LLMs on tasks requiring semantic similarity matching rather than exact text lookup. Open-source models showed inverted U-shaped performance curves beyond critical context thresholds — accuracy degrades substantially when the task requires reasoning about meaning rather than matching strings.

KV cache quantization compounds these losses. FP16 and Q8 KV cache produce no measurable retrieval degradation. Q4 KV cache adds detectable accuracy loss on top of context rot, particularly on retrieval tasks. For any workload where finding information in context matters, Q8 KV cache is the minimum — Q4 saves VRAM at the cost of making the context less reliable.

TurboQuant: KV cache compression (ICLR 2026)

TurboQuant (Google Research, ICLR 2026) compresses KV cache to 2.5-3.5 bits per channel with near-zero quality loss. This is not weight quantization — it is complementary to Q4_K_M model weights. You can run a model at Q4 weights AND TurboQuant 3-bit KV cache simultaneously.

KV cache method Bits/channel LongBench score Needle retrieval VRAM per 1K tokens (8B model)
FP16 (baseline) 16 50.06 0.997 128 MB
Q8 (current standard) 8 ~50.06 ~0.997 64 MB
TurboQuant 3.5 50.06 0.997 ~28 MB
TurboQuant 2.5 49.44 ~20 MB
Q4 (current) 4 degrades degrades 32 MB
KIVI 3-bit 3 48.50 0.981 ~24 MB

At 3.5 bits, TurboQuant matches FP16 quality exactly — identical LongBench score, identical needle retrieval. At 2.5 bits, it outperforms KIVI at 3 bits. The method is data-oblivious: no calibration data, no per-model tuning. It works by applying a random orthogonal rotation that makes vector coordinates near-independent, then using mathematically optimal scalar quantizers.

If TurboQuant 3-bit KV becomes standard, the VRAM context table above roughly doubles: an 8B model on 24 GB at 3-bit KV would reach 128K comfortably instead of needing Q8 at 128K. A 70B model on 96 GB at 3-bit KV would push past 200K context.

Status (April 2026): No official Google implementation. Community ports exist for llama.cpp (TQ3_0 format, not merged), vLLM (Triton kernels, not merged), and PyTorch/HuggingFace. None are in mainline frameworks yet. Official integration expected Q2-Q3 2026.

Hardware cost to match API context

API capability Best local equivalent Hardware cost What you actually get
GPT-4.1 at 128K (A- quality) 8B Q4 on RTX 4090 ~EUR 1,800 128K context but C+ quality — 2 tiers below
GPT-4.1 at 128K (A- quality) 70B Q4 on 2x RTX 4090 ~EUR 3,500 128K context at B+ quality — 1 tier below
Claude 1M (A+ quality) 70B Q4 on A100 80GB ~EUR 15,000 Capped at 128K (872K shorter), B+ quality
Claude 1M (A+ quality) Nothing No open model trained past 128K
Gemini 1M (A+ quality) Nothing Not achievable locally at any price

The gap is stark beyond 128K. No combination of hardware and open models reaches 200K context. All four frontier API providers now offer 1M-2M at A+ quality. Self-hosted tops out at 128K at B+ quality. For workloads that need the full context — codebase analysis, long document processing, extended agent sessions — APIs are the only option.

Working around the 128K ceiling

Several techniques exist to process more data than fits in a single context window. None of them are transparent substitutes for native long context.

RAG (Retrieval-Augmented Generation) is the most production-ready approach. Split documents into chunks, embed them in a vector database, retrieve only the relevant chunks per query. Quality reaches 70-85% of native long context for retrieval tasks (EMNLP 2024), but fundamentally cannot do cross-document reasoning — connecting a clause on page 40 to a definition on page 3 requires both to be in context simultaneously. RAG turns the problem from "LLM reasons over all data" into "retrieval system selects data, LLM reasons over selection." The retrieval quality becomes the ceiling.

Context compression (LLMLingua, Microsoft) removes unimportant tokens to fit more into the same window. At 2-4x compression, quality loss is minimal. At 10-20x, degradation is measurable. A 128K window with 4x compression gives roughly 512K effective tokens — useful, but still below 1M native context and with quality loss on compressed portions.

Map-reduce / chunking processes each chunk independently, then combines results. Works for summarization (60-80% quality). Fails for reasoning that requires cross-chunk information.

Hierarchical summarization — summarize chunks, then summarize summaries — works for gist but fails for precision. Each summarization round is lossy, and hallucination amplification compounds: a fabricated fact in round 2 becomes "established fact" in round 3.

StreamingLLM solves a different problem. It enables continuous generation over arbitrarily long streams by preserving "attention sink" tokens, but the model can only see the current window. Information outside the window is gone. Useful for long chat sessions, not for document analysis.

Technique Production ready? Quality vs native long context Real substitute?
RAG Yes 70-85% (retrieval) / poor (reasoning) Partial — good for search, poor for synthesis
LLMLingua compression Yes 85-95% at 2-4x compression Partial — extends window modestly
Map-reduce / chunking Yes 60-80% depending on task No — loses cross-chunk connections
Hierarchical summarization Yes (for summaries) 50-70% for detail; good for gist No — lossy compression compounds
StreamingLLM Yes (for chat) N/A (different problem) No — does not extend reasoning context

The LaRA benchmark (ICML 2025) tested 2,326 cases across multiple tasks and concluded there is no silver bullet — the best approach depends on task type, model capabilities, and retrieval characteristics. One concrete data point: switching from chunked medical records to full-context patient histories improved diagnostic accuracy by 23% because the model could see temporal patterns across years of data.

The "lost in the middle" problem (Liu et al., TACL 2024) complicates even native long context: models perform best when relevant information is at the beginning or end of the context, with 30%+ performance degradation for information in the middle. This affects all models, including those designed for long context.

For self-hosters, the practical hierarchy is: if the workload fits within 128K, local models are competitive. If it needs 128K-512K, RAG or compression can partially bridge the gap with measurable quality penalties. Beyond 512K, APIs with native 1M context are the only reliable option.

API pricing (April 2026)

API pricing dropped roughly 80% since early 2025. Budget-tier models now cost $0.02-0.40 per million tokens.

Tier Model Input $/MTok Output $/MTok Context Quality
Budget Mistral Nemo $0.02 $0.04 128K C+
Budget GPT-4.1-nano $0.05 $0.20 1M C+
Budget GPT-4.1-mini $0.16 $0.64 1M B-
Mid DeepSeek V3.2 $0.28 $0.42 128K B+
Mid Gemini 2.5 Flash $0.15 $3.50 1M B+
Mid GPT-4.1 $2.00 $8.00 1M A-
Mid Grok 4.1 Fast $0.20 $0.50 2M B+
Mid Grok 4.20 $2.00 $6.00 2M A
Mid Claude Haiku 4.5 $1.00 $5.00 200K A-
Frontier Claude Sonnet 4.6 $3.00 $15.00 1M A+
Frontier Gemini 2.5 Pro $1.25 $10.00 1M A+
Frontier Claude Opus 4.6 $5.00 $25.00 1M S
Frontier OpenAI o3 $2.00 $8.00 200K S (reasoning)

Subscription tiers exist for heavy users: Claude Pro at $20/month, Claude Max at $100-200/month with higher rate limits. Google, OpenAI, and others have similar tiered access. The per-token rates above are the pay-as-you-go ceiling, not the effective cost for regular users.

Cached input pricing drops costs further. Claude cache hits cost 0.1x the base rate. Gemini 2.5 Flash cached input is $0.03/MTok. Workloads with repeated context (RAG, agent loops) benefit from caching.

The cost comparison

Self-hosted at 100% utilization

Best case for self-hosting: a GPU running batch inference around the clock.

Setup Model Monthly cost (3yr amort.) Tokens/month EUR/MTok Quality
RTX Pro 6000 70B vLLM batched ~EUR 380 ~2.7B EUR 0.14 B/B+
RTX Pro 6000 8B vLLM batched ~EUR 380 ~23B EUR 0.017 C+
2x RTX 3090 8B single-stream ~EUR 130 3.7B EUR 0.48 C+
Strix Halo 70B single-stream ~EUR 90 ~13M EUR 0.50 B/B+

All costs in EUR. RTX Pro 6000 at ~$8,500 = ~EUR 7,800 at April 2026 exchange rates. 3-year amortization includes hardware + electricity at EUR 0.15/kWh.

At 100% utilization, self-hosted 8B on an RTX Pro 6000 matches Mistral Nemo API pricing (EUR 0.034/MTok) while running the model locally. Self-hosted 70B at EUR 0.14/MTok undercuts DeepSeek V3.2 API (EUR 0.39/MTok) for comparable B+ output.

The utilization trap

Nobody runs a single-operator GPU at 100%. You ask questions during working hours. The GPU idles overnight, on weekends, during meetings. Realistic utilization for one person or a small team: 10-30%.

Utilization RTX Pro 6000 70B EUR/MTok vs DeepSeek V3.2 API (EUR 0.39)
100% EUR 0.14 Self-hosted 2.8x cheaper
50% EUR 0.28 Self-hosted 1.4x cheaper
20% EUR 0.70 API 1.8x cheaper
10% EUR 1.40 API 3.6x cheaper

At 20% utilization, self-hosted 70B costs EUR 0.70/MTok for B+ quality output. The same money buys Gemini 2.5 Flash at B+ quality via API, with no hardware to maintain and 1M token context.

The GPU does not know you went to lunch. It depreciates whether it generates tokens or not.

Hardware depreciation

The RTX Pro 6000 costs ~EUR 7,800 today. When NVIDIA ships Rubin (next-generation architecture, expected ~2027), the Pro 6000 will be worth roughly EUR 3,700-4,600. Three years out: EUR 2,300-2,800. GPU depreciation runs 40-60% over 2-3 years.

API subscriptions depreciate at 0%. Every month brings the latest models. No replacement planning, no resale.

The lease option (EUR 500/month for 18 months, EUR 9,000 total) costs more than buying outright.

Quantization

Quantization compresses model weights to use less memory and run faster, at some quality cost.

70B model

Quantization Bits/weight 70B model size Quality vs FP16 Speed vs FP16 Minimum GPU
FP16 (full precision) 16.0 ~140 GB 100% 1.0x (baseline) 2x H100 SXM
Q8_0 8.0 ~70 GB ~99% ~1.8-2.0x 1x RTX Pro 6000
Q6_K 6.5 ~54 GB ~99% ~2.3-2.5x 1x RTX Pro 6000
Q5_K_M 5.5 ~48 GB ~96-97% ~2.5-3.0x 1x RTX Pro 6000
Q4_K_M (common default) 4.8 ~40 GB ~96-99% (model-dependent) ~3.0-3.5x 1x RTX Pro 6000
Q3_K_M 3.4 ~33 GB ~90-95% ~3.5-4.0x 2x RTX 3090
Q3_K_S 3.0 ~28 GB ~85-92% ~4.0-4.5x 2x RTX 4060 Ti 16GB
Q2_K (emergency) 2.7 ~27 GB ~75-85% ~4.0-5.0x 1x RTX 5090
AQLM 2-bit (trained codebooks) 2.0 ~18 GB ~95-98% GPU-only 1x RTX 4090
BitNet 1.58-bit (hypothetical) 1.58 ~14 GB unknown at 70B CPU-native 1x RTX 3060 (no model exists)

8B model (Llama 3.1 8B)

Quantization Bits/weight 8B model size Quality vs FP16 Minimum GPU
FP16 16.0 ~16 GB 100% 1x RTX 4090 (24 GB)
Q8_0 8.0 ~8.5 GB ~99% 1x RX 5700 XT (8 GB, tight)
Q6_K 6.5 ~6.3 GB ~99% 1x RX 5700 XT (8 GB)
Q5_K_M 5.5 ~5.7 GB ~98-99% 1x RX 5700 XT (8 GB)
Q4_K_M (common default) 4.8 ~4.7 GB ~96-99% 1x GTX 1060 6GB
Q3_K_M 3.4 ~3.7 GB ~85-90% 1x GTX 1050 Ti (4 GB, tight)
Q2_K 2.7 ~3.1 GB ~75-85% 1x GTX 1050 Ti (4 GB)

Gemma 4 26B-A4B (MoE)

Mixture-of-experts models store all 26B parameters but only activate 4B per token. Storage follows total parameter count; inference speed follows active parameter count.

Quantization Total size (26B stored) Active per token (4B) Minimum GPU Effective speed
FP16 ~52 GB ~8 GB 1x RTX Pro 6000 Baseline
Q8_0 ~26 GB ~4 GB 1x RTX 5090 (32 GB) ~2x
Q4_K_M ~16 GB ~2.4 GB 1x RTX 4060 Ti 16GB ~3-3.5x
Q2_K ~8 GB ~1.3 GB 1x RX 5700 XT (8 GB) ~4-5x, quality loss

At Q4_K_M, the full Gemma 4 MoE fits on a 16 GB card with room for KV cache, while only reading ~2.4 GB of active weights per token. That makes it extremely fast: the RTX Pro 6000 hits 252 t/s on a similar MoE architecture (Qwen3 30B-A3B).

What quantization actually costs in quality

The degradation curve has two phases. From Q8 through Q4_K_M, quality loss is measurable but rarely perceptible. Below Q4, quality collapses non-linearly.

Quant Size reduction Perplexity loss Benchmark avg (Llama 3.1 8B) Verdict
Q8_0 47% +0.01% 69.41 (vs 69.47 FP16) Indistinguishable from FP16
Q6_K 59% +0.06% 69.23 Negligible loss
Q5_K_M 64% +0.2% 69.36 Blind testers cannot distinguish from FP16
Q4_K_M 69% +0.7% 69.15 Recommended default — 0.32 points below FP16
Q3_K_M 75% +3.3% 68.07 Usable but noticeable on reasoning tasks
Q3_K_S 77% +5.7% 65.49 Math drops from 77.6 to 68.3 (GSM8K)
Q2_K 82% +22% Extreme quality loss; perplexity climbs non-linearly
IQ2_XXS 85% +18% Catastrophic on small models (7% accuracy vs 42% at Q4)

Sources: llama.cpp Discussion #2094 (canonical perplexity), arxiv 2601.14277 (full benchmark table, Jan 2025), GFMath (IQ2_XXS catastrophe on ~70 models).

A blind test with 500+ votes on Mistral 7B confirmed: human evaluators cannot distinguish Q5_K from FP16, but consistently identify IQ1_S as worse.

What the quality loss looks like in practice: At Q8 and Q5, output is indistinguishable from full precision — same word choices, same reasoning chains, same code. At Q4_K_M, output reads identically on most prompts; edge cases in math and multilingual tasks occasionally produce different (slightly worse) answers. At Q3, reasoning tasks start showing wrong intermediate steps — the model "almost" gets it but takes a wrong turn more often. At Q2_K and below, output visibly degrades: repetitive phrasing, lost coherence on longer responses, math errors on problems the full-precision model solves correctly, and noticeably worse code generation. The blind test data matches: humans cannot spot the difference until around Q3, and confidently identify degradation at IQ1_S.

Tasks hurt differently. Coding and STEM degrade most from quantization (IJCAI 2025). Multilingual loses 15-20% at Q4 (ionio.ai). Instruction following (IFEval) drops >10% at Q4 on some model families. Red Hat's 500K evaluations on Llama 3.1 show >99% recovery at 8-bit and 96-99% at 4-bit with calibrated methods.

Model family matters: Qwen2.5 is the most quantization-tolerant family. LLaMA 3.3 is the most fragile — not recommended at Q4 or below.

Bigger quantized beats smaller full precision

If VRAM is the constraint, always choose the larger model at lower precision over a smaller model at full precision. The data is clear:

Configuration Perplexity VRAM Model family
13B at Q4 5.41 ~8 GB Llama 1
7B at FP16 5.96 ~14 GB Llama 1
65B at Q2_K 4.10 ~20 GB Llama 1
33B at FP16 4.16 ~66 GB Llama 1

These numbers are from the original Llama 1 family (llama.cpp canonical perplexity data). The principle holds across newer model families, though quantization tolerance varies: Qwen2.5 is the most tolerant, LLaMA 3.3 the most fragile.

The 13B at Q4 uses half the VRAM of the 7B at FP16 and produces better output. The 65B at Q2_K matches the 33B at FP16 in one-third the memory. Parameter count beats precision until you hit extreme quantization (below Q3), where both advantages erode.

All four major 4-bit methods (GPTQ, AWQ, EXL2, GGUF Q4_K_M) achieve nearly identical quality (perplexity 4.31-4.36 on Llama-2-13B). The differences are speed (EXL2 fastest) and VRAM efficiency (GPTQ most efficient), not quality.

Extreme quantization: ternary weights and sub-2-bit

All of the above methods are post-training quantization, compressing an FP16 model after it has been trained. A different approach trains the model with quantized weights from the start.

BitNet b1.58 (Microsoft Research / Tsinghua University) constrains every weight to one of three values: {-1, 0, +1}. That is log2(3) = 1.58 bits per weight. Multiplications become additions. A 2B parameter model fits in 0.4 GB instead of ~4 GB at FP16, roughly 10x smaller.

The only production-quality open model is bitnet-b1.58-2B-4T (2 billion parameters, 4 trillion training tokens, Apache 2.0 license). Community models exist at 8B but are experimental.

Model MMLU (5-shot) Size Speed (i7-13800H) EUR/MTok (15W laptop)
BitNet b1.58 2B 53.17 0.4 GB ~34 t/s EUR 0.006
Llama 3.2 1B (FP16) 49.30 ~2.5 GB ~21 t/s EUR 0.010
Qwen2.5 1.5B (FP16) 60.25 ~3 GB ~15 t/s EUR 0.014

The energy savings are real: 71-82% less energy than FP16 on x86, 55-70% on ARM. These numbers come from Microsoft's bitnet.cpp paper and have been verified across multiple sources. The inference framework (bitnet.cpp) is production-ready on CPU, with GPU support added in May 2025.

There is one catch that matters for anyone evaluating this: speed and energy benefits only appear when using bitnet.cpp. Loading the model in standard HuggingFace transformers gives you the smaller memory footprint but none of the ternary kernel speedups. You cannot use Ollama, LM Studio, or any other llama.cpp frontend for native BitNet inference as of April 2026.

Post-training methods can also push below 2 bits. AQLM (Yandex/IST Austria, ICML 2024) compresses Llama 2 7B to 2.02 bits with a WikiText2 perplexity of 6.59, versus 5.47 at FP16, roughly 2-5% task quality degradation. QTIP (NeurIPS 2024 Spotlight) achieves similar quality at 2-bit with >3x inference speedup. These are GPU-only methods: they reduce VRAM usage, not power draw.

Does extreme quantization change the economics?

At first glance, 10x smaller models should flip the self-hosting math. A 2B model in 0.4 GB running on a 15W laptop at 25 tokens per second costs about EUR 0.006 per million tokens in electricity, 6x cheaper than Mistral Nemo API.

The problem is quality. BitNet 2B scores MMLU 53. Mistral Nemo 12B scores around 70. These are not the same tier of model. A 2B model, regardless of quantization method, cannot follow complex instructions, write useful code, or maintain coherence over long conversations. Saving EUR 0.03 per million tokens while getting dramatically worse output is not a savings.

The real promise is running larger models on cheaper hardware. A hypothetical 70B BitNet model would need ~18 GB instead of 140 GB, fitting on a single RTX 3090. But the model zoo has not caught up to the technique. Microsoft's verified model is 2B. PrismML released Bonsai 8B in March 2026, claiming 1.15 GB footprint and 368 t/s on an RTX 4090, but these are vendor claims and independent benchmarks are still sparse. Until larger ternary models are independently verified, extreme quantization changes what is theoretically possible without changing what you can actually run today.

Software

Running a model locally in 2026 is straightforward. The tooling is good.

Ollama wraps llama.cpp behind a REST API with model management. Install, pull a model, start chatting.

llama.cpp is the foundation. CUDA, Vulkan, Metal, CPU with AVX-512/AMX. GGUF quantization format.

ik_llama.cpp is an optimized fork of llama.cpp with rewritten CPU kernels. 1.8-5.2x faster prompt processing on AVX2/AVX-512 CPUs. Same GGUF models, drop-in replacement. Use this instead of mainline llama.cpp for CPU-bound workloads. Benchmarks in the CPU comparison section.

vLLM handles concurrent serving with PagedAttention and continuous batching. Use this when serving multiple users from one GPU.

bitnet.cpp is required for native ternary inference. Separate from llama.cpp. CPU-primary, with CUDA support since May 2025. Requires Clang 18+ to build.

The cheapest hardware myth

A common misconception: "I already own the hardware, so inference is free." It is not. Electricity costs money, and consumer GPUs draw 230-415W under inference load. Even ignoring the purchase price entirely, the electricity to generate tokens locally costs more per token than calling a budget API.

Proof: idle desktop CPU

Take an Intel i5-7500T in an HP ProDesk 400 G3 Desktop Mini. A machine that costs EUR 150 used and sits idle most of the day. DDR4-2400 dual-channel gives ~38 GB/s theoretical memory bandwidth. At 50% efficiency with 4 threads, that translates to roughly 4-5 tokens per second on an 8B Q4 model.

The system draws about 22W idle and 45W under inference load. The marginal 23W for inference costs EUR 2.48 per month in electricity at Finnish rates (EUR 0.15/kWh).

Model Size on disk Tokens/sec (est.) Tokens/month (24/7) EUR/MTok (electricity only) vs Mistral Nemo API
Qwen3 0.6B Q4 ~0.4 GB ~40 103.7M EUR 0.024 1.5x cheaper, but trivial tasks only
BitNet b1.58 2B (ternary) 0.4 GB ~15-25 51.8M EUR 0.048 1.3x more expensive, D-tier quality
TinyLlama 1.1B Q4 ~0.7 GB ~28 72.6M EUR 0.034 Break-even, basic completion only
Bonsai 8B (1-bit, unverified) 1.15 GB ~15 38.9M EUR 0.064 1.7x more expensive, claims 8B quality
Phi-4 Mini 3.8B Q4 ~2.2 GB ~7.6 19.7M EUR 0.126 3.4x more expensive
Llama 3.1 8B Q4 ~4.7 GB ~4-5 11.7M EUR 0.212 5.7x more expensive

BitNet b1.58 2B runs via bitnet.cpp (not llama.cpp). The ternary weights reduce memory reads, but the i5-7500T's 4 cores with AVX2-only still cap throughput around 15-25 t/s. Bonsai 8B speed is estimated from memory bandwidth; PrismML's claimed 368 t/s is on an RTX 4090 GPU. Independent CPU benchmarks for Bonsai are sparse as of April 2026.

At the same C+ quality tier (8B model), the API costs EUR 0.037 per million tokens. The i5-7500T costs EUR 0.212. The API is 5.7 times cheaper on electricity alone, before counting the computer's purchase price.

The extreme quantization models do not change the math. BitNet 2B at EUR 0.048/MTok is 1.3x more expensive than the API and produces D-tier output (MMLU 53 vs Mistral Nemo's ~70). The only models where electricity beats the API are sub-1B parameter models at Q4, and their output quality is so far below Mistral Nemo that the comparison is meaningless.

CPU comparison: DDR4 to DDR5

The i5-7500T is the cheapest case. How do faster CPUs compare? Memory bandwidth is the bottleneck — more channels, faster memory, more tokens per second.

CPU Memory Bandwidth (theoretical) Cores System idle/load Llama 3.1 8B Q4 t/s (est.) EUR/MTok (marginal electricity)
Intel N100 DDR5-4800 1-ch 38.4 GB/s 4E 8W / 16W ~3 EUR 0.111
Intel i3-N305 DDR5-4800 1-ch 38.4 GB/s 8E 10W / 22W ~3.5 EUR 0.143
Intel i5-7500T DDR4-2400 2-ch 38.4 GB/s 4C/4T 22W / 45W ~4-5 EUR 0.212
Intel i5-8500T DDR4-2666 2-ch 42.7 GB/s 6C/6T 22W / 48W ~5-6 EUR 0.180
AMD Ryzen 5 5600X DDR4-3200 2-ch 51.2 GB/s 6C/12T 45W / 95W ~6-7 EUR 0.297
AMD Ryzen 9 5900X DDR4-3200 2-ch 51.2 GB/s 12C/24T 55W / 120W ~6-8 EUR 0.338
AMD Ryzen 9 7950X DDR5-5200 2-ch 83.2 GB/s 16C/32T 65W / 145W ~10-12 EUR 0.277

Electricity rate: EUR 0.15/kWh. EUR/MTok calculated from marginal power (load minus idle) at estimated token rate, 24/7 operation. N100 and i3-N305 are single-channel memory, which caps bandwidth despite DDR5 speeds. The Ryzen 7950X is 2-3x faster than the i5-7500T thanks to DDR5 dual-channel (83 GB/s vs 38 GB/s).

Every CPU in this table costs 3x to 9x more per token in electricity than the Mistral Nemo API (EUR 0.037/MTok). The fastest desktop CPU tested (Ryzen 7950X) still costs 7.5x more. More cores do not help — dual-channel DDR4 or DDR5 memory bandwidth is the wall.

One exception to the "llama.cpp is llama.cpp" assumption: ik_llama.cpp, an optimized fork, achieves 1.8-5.2x faster prompt processing than mainline llama.cpp on the same hardware. Benchmarks on a Ryzen 7950X (16 threads, AVX2):

Quantization ik_llama.cpp (t/s) llama.cpp (t/s) Speedup
BF16 256.9 78.6 3.3x
Q8_0 268.2 147.9 1.8x
Q4_0 273.5 153.5 1.8x
IQ3_S 156.5 30.2 5.2x
Q8_K_R8 370.1 N/A new format

These are prompt processing (prefill) speeds, not generation. Generation improves a more modest 1.0-2.1x. The key finding: on the Ryzen 7950X, the slowest quantization type in ik_llama.cpp is faster than the fastest type in mainline llama.cpp for prompt processing. For CPU inference workloads that involve large prompts — document summarization, RAG extraction, batch classification — the fork makes a material difference. It does not change the electricity cost comparison (the watts are the same), but it reduces wall-clock time per job, which matters for throughput-limited batch work.

Proof: every consumer GPU loses

The pattern holds across dedicated GPUs. Inference power draw runs about 55-75% of the card's TDP (the decode phase is memory-bandwidth-bound, not compute-bound). Add 120W for the rest of the system.

GPU VRAM Llama 3.1 8B Q4 tok/s System watts EUR/MTok (electricity) vs API
RX 5700 XT 8 GB ~55 (Vulkan, no ROCm) 266W EUR 0.201 5.4x
RX 6700 XT 12 GB ~45 270W EUR 0.250 6.8x
RX 6900 XT 16 GB ~60 315W EUR 0.219 5.9x
RTX 3060 12 GB ~42 230W EUR 0.228 6.2x
RTX 3090 24 GB ~87 345W EUR 0.165 4.5x
RTX 4060 Ti 16GB 16 GB ~48 227W EUR 0.197 5.3x
RTX 4090 24 GB ~128 413W EUR 0.134 3.6x
RX 7900 XTX 24 GB ~80 351W EUR 0.183 4.9x

API baseline: Mistral Nemo at EUR 0.037/MTok ($0.04/MTok output). Electricity rate: EUR 0.15/kWh. RX 5700 XT runs Vulkan only (ROCm dropped Navi 10 support); speed scaled from llama.cpp Vulkan scoreboard data.

Every consumer GPU costs 3.6 to 6.8 times more in electricity per token than the API. The RTX 4090, the fastest consumer card, still costs 3.6x more. At German electricity rates (EUR 0.30/kWh), it costs 7.2x more.

The ex-mining AMD cards are fast for their price but still lose to the API. The RX 5700 XT draws 266W system power for ~55 tokens per second — its wide 256-bit memory bus (448 GB/s) makes it the fastest sub-$200 used card. The RX 6900 XT draws 315W for ~60 t/s: 18% more power for 9% more speed over the 5700 XT, because both share a 256-bit bus and differ mainly in clock speed.

For the best consumer GPU (RTX 4090) to break even with Mistral Nemo API on electricity alone, the electricity rate would need to be EUR 0.041/kWh. That is below any residential rate in Europe.

Enterprise GPUs: faster but still losing

Enterprise GPUs have HBM memory with 2-4x the bandwidth of consumer GDDR. They are faster per token but draw more power and sit in servers with higher base consumption. System power assumes +200W for a server chassis (vs +120W for a desktop).

GPU VRAM Bandwidth 8B Q4 tok/s (est.) System watts EUR/MTok (electricity) vs API Purchase price (Apr 2026 est.)
L40S 48 GB GDDR6 864 GB/s ~110 550W EUR 0.208 5.6x $7,500-9,000
A100 40GB PCIe 40 GB HBM2e 1,555 GB/s ~195 450W EUR 0.096 2.6x $5,000-8,000 (used)
A100 80GB SXM 80 GB HBM2e 2,039 GB/s ~260 600W EUR 0.096 2.6x $12,000-18,000
RTX Pro 6000 96 GB GDDR7 1,792 GB/s ~279 (7B bench) 720W EUR 0.107 2.9x $8,000-9,200
H100 SXM 80 GB HBM3 3,352 GB/s ~400 900W EUR 0.094 2.5x $22,000-30,000

The H100, the fastest GPU in this comparison, still costs 2.5x more per token in electricity than the API. The L40S is worse than consumer GPUs per token because its 864 GB/s bandwidth is slower than the RTX 4090's 1,008 GB/s but draws more system power in a server chassis.

Enterprise GPUs close the gap but do not flip the math. Where they earn their price is batched serving: an H100 running vLLM with 32 concurrent users achieves aggregate throughput that amortizes the power draw across requests. Single-stream single-user inference, which is what this table measures, is the worst case for expensive hardware.

What this means

Hardware cost, depreciation, driver maintenance, cooling, and noise all come on top of electricity. If electricity alone already loses to the API, the total cost is not getting better.

Self-hosting has real advantages: privacy, zero network latency, offline capability, fine-tuning. Those are legitimate reasons to run local inference. "It is cheaper" is not one of them.

Fine-tuning: where self-hosting wins decisively

Fine-tuning is the strongest capability advantage for self-hosting. A fine-tuned 7B model routinely beats a generic 70B+ model on the specific task it was trained for (arxiv 2406.08660, Stanford). Multi-task fine-tuned Phi-3-Mini (3.8B) surpassed GPT-4o on financial benchmarks (ACM/OpenReview). A fine-tuned 350M model beat ChatGPT by 3x on structured tool calling.

What you need

Component Minimum Recommended
Dataset 50-100 examples (classification) 500-2,000 examples (generation tasks)
GPU (QLoRA) RTX 4060 Ti 16 GB (up to 7B) RTX 4090 24 GB (up to 14-20B)
GPU (full fine-tune) A100 80 GB (7B) 4x A100 (32B)
Time per run 1-2 hours (7B, QLoRA) 3-6 hours (14B, QLoRA)
Framework LLaMA-Factory (beginner, web UI) Unsloth (2x speed, 70% less VRAM)

QLoRA is the key: it reduces VRAM requirements by 4-8x versus full fine-tuning with minimal quality loss. A 7B model that needs 60-80 GB for full fine-tuning needs only 5-10 GB with QLoRA. Quality is within 1-2% of full fine-tuning on most tasks.

500 carefully curated examples outperform 5,000 noisy ones. The LIMA paper demonstrated that 1,000 well-crafted examples could produce a competitive instruction-following model. Quality of training data matters more than quantity.

API fine-tuning is limited

Provider Models available for fine-tuning Export weights? Pricing
OpenAI GPT-4.1, 4.1 Mini, 4o Mini No $0.80-3.00/M training tokens
Google Gemini Flash/Pro (Vertex AI only) No Per compute-hour
Anthropic Claude 3 Haiku only (AWS Bedrock only) No AWS compute pricing
xAI None N/A N/A

No API provider lets you export model weights. Your fine-tuned model only works through their API, at their pricing, subject to their deprecation schedule. Self-hosted fine-tuning produces weights you own, deploy anywhere, and keep forever.

API fine-tuning also cannot do: LoRA merging (combine multiple adapters), DPO/GRPO (preference alignment), model merging (TIES/DARE/SLERP), continued pretraining, or knowledge distillation. These techniques are only available with local hardware.

Cost break-even

API costs below include both fine-tuning compute (training runs at $0.80-3.00/MTok) and inference on the fine-tuned model (at elevated fine-tuned model rates). Self-hosted costs include hardware amortization (RTX 4090, 3 years) plus electricity.

Usage pattern API cost (1 year, training + inference) Self-hosted cost (1 year) Break-even
Occasional (1 training run/month, light inference) ~EUR 150-300 ~EUR 1,700 (hardware + power) ~8-12 months
Moderate (weekly retraining, daily inference) ~EUR 2,000-5,000 ~EUR 1,750 ~4 months
Heavy (daily retraining, continuous inference) ~EUR 15,000-50,000 ~EUR 2,000 2-4 weeks

The inference cost is the driver. Self-hosted inference on a fine-tuned model costs only electricity after hardware; API inference on a fine-tuned model costs $0.80-3.00/M tokens input, forever. At moderate-to-heavy inference volume, the hardware pays for itself quickly.

When NOT to fine-tune

Fine-tuning teaches behavior and domain patterns, not reasoning ability. A fine-tuned 7B cannot match GPT-4 on general reasoning. It can match or beat GPT-4 on the specific narrow task it was trained for.

If knowledge changes frequently, use RAG instead — retrieval stays current without retraining. If factual accuracy with source citation is paramount, RAG provides provenance that fine-tuning cannot. The RAFT study (UC Berkeley) found that combining fine-tuning for behavior with RAG for knowledge outperforms either approach alone.

Catastrophic forgetting is real: fine-tuning performance and general knowledge loss have a strong inverse relationship. LoRA/QLoRA reduces this (fewer parameters modified) but does not eliminate it.

When self-hosting makes sense

If data cannot leave your premises (regulatory, air-gapped, classified), API access is off the table. Self-hosted is the only option and cost comparisons do not apply.

Bulk classification, embedding generation, and document processing with a 7-8B model on modest hardware is cheap and fast. A used RTX 3090 ($800-900) runs Llama 3.1 8B at 87-112 t/s depending on configuration. These tasks do not need frontier quality.

Genuine 24/7 batch workloads keeping a GPU above 80% utilization get real per-token savings. This applies to organizations running inference as a service, not individuals using LLMs as assistants.

Fine-tuning on proprietary data is the clearest self-hosting win — see the section above. A $1,600 RTX 4090 with QLoRA handles models up to 14-20B, and the resulting model is yours to deploy anywhere.

Needle-in-a-haystack retrieval and semantic search over large private document collections are where local inference earns its cost. RAG pipelines that embed and search hundreds of thousands of documents generate millions of tokens per day in embedding and extraction queries. These are high-volume, low-quality-bar tasks where an 8B model is sufficient and API costs would compound. A single GPU running embeddings at 80-100% utilization processes the volume at a fraction of API pricing because the per-token overhead is amortized over sustained throughput.

Embedding models are often the strongest case for local deployment. Embedding models (22M-334M parameters) are distinct from generative LLMs — they produce vector representations for semantic search and RAG, not text. They run on any CPU at hundreds to thousands of embeddings per second with negligible power draw and 1-2 GB RAM total.

Model Parameters Use case Speed (CPU) RAM
all-minilm-l6-v2 22M fast filtering, similarity ~5,000 embed/sec <1 GB
nomic-embed-text 274M semantic search, RAG ~1,000 embed/sec ~1 GB
mxbai-embed-large 334M high-quality embeddings ~500 embed/sec ~1.5 GB

A 22M embedding model running on any desktop CPU powers local semantic search over thousands of documents with zero API cost. For infrastructure use cases — log analysis, document similarity, RAG retrieval — embedding models deliver more production value than 1-4B generative models. They are the one category where "I already own the hardware" is genuinely true: the compute is trivial, the electricity is negligible, and the privacy benefit is real. API-based embedding (OpenAI ada-002 at $0.10/MTok) adds up fast at scale; local embedding is effectively free after the one-time setup.

Document summarization at scale follows the same pattern. Summarizing 10,000 PDFs through an API costs real money at $0.04-3.00 per million tokens depending on the model tier. On local hardware, the marginal cost is electricity and the model runs as fast as bandwidth allows. A Ryzen 7950X with 64 GB RAM can summarize documents at ~10 t/s on an 8B model continuously without per-request billing. An RTX 3090 does the same at ~87 t/s.

Model stability: no silent changes

APIs ship updates you do not see. Between the February 2026 launch of Claude Opus 4.6 and late March 2026, Anthropic shifted default behavior in ways that measurably degraded output quality for heavy users. The 5 February 2026 Opus 4.6 launch made adaptive thinking the recommended mode, letting the model decide its own reasoning budget per turn. Anthropic's own documentation states plainly: "At the default effort level (`high`), Claude almost always thinks. At lower effort levels, Claude may skip thinking for simpler problems," and warns in the tuning section that "Steering Claude to think less often may reduce quality on tasks that benefit from reasoning." Claude Code harness defaults shifted during this window, and trade press reported peak-hour session-cap tightening in late March. Each change appeared in changelogs or third-party reporting. None was broadcast to users as a quality change. The combined effect was indistinguishable from a silent quality regression.

What Anthropic actually promises

In its September 2025 engineering postmortem, Anthropic states the narrow, auditable commitment: "We never reduce model quality due to demand, time of day, or server load." That is a specific promise about one mechanism class — load-based quality throttling. It is not a promise that user-perceptible experience is stable. The same postmortem disclosed three concurrent infrastructure bugs affecting Sonnet 4, Opus 4.1, Opus 4, and Haiku 3.5 between 5 August and 18 September 2025: a context-window routing error that peaked at 16% of Sonnet 4 requests on 31 August, an output-corruption bug that produced Thai and Chinese characters in English responses between 25 August and 2 September, and an approximate top-k XLA:TPU miscompilation affecting Haiku 3.5 from 25 August onward. All three were infrastructure problems, not deliberate model changes. None were externally visible as defects until Anthropic published the postmortem.

Anthropic's own documentation also contains a striking admission about the Opus 4.7 rollout. The adaptive thinking documentation notes that the `thinking.display` default changed from `summarized` to `omitted` when Opus 4.7 shipped, and describes the change in its own words: "This is a silent change from Claude Opus 4.6, where the default was `summarized`." The provider itself uses the phrase "silent change" to describe a user-visible behavior shift between model versions at the same identifier pattern. This is the structural issue, not Anthropic-specific malice: API defaults drift between releases, and the only way a customer learns is by reading every changelog entry against every model version they depend on.

Evidence, measured and debunked

Telemetry from 6,852 Claude Code sessions filed in GitHub issue #42796 by AMD engineer Stella Laurenzo measured a 73% collapse in visible reasoning depth between January and March 2026 (median thinking chars falling from ~2,200 to ~600), a Read:Edit tool-call ratio falling from 6.6 to 2.0, and an 80x increase in deduplicated API request volume. Concurrent session scale-up explains roughly 5-10x of that multiplier; the remaining 8-16x tracks degradation-induced thrashing — retries, corrections, and wrong outputs that would not have been needed had the model reasoned properly the first time. Input-token volume rose 170x between February and March (120M to 20.5B), estimated daily cost rose 122x ($12 to $1,504), reasoning loops per thousand tool calls rose 156%, and user-interrupt events per thousand tool calls rose 556%. Anthropic shipped Opus 4.7 within two weeks of the report gaining traction, and raised the default Claude Code harness effort level to `xhigh`.

Not every "nerf proof" holds up. The most-shared visual evidence — a BridgeBench comparison showing Opus scoring 83% before the alleged nerf and 68% after — failed basic methodological review: the first run measured 6 tasks and the retest measured 30 tasks, so the two scores were not comparing like with like. On the overlapping six tasks, the before-and-after numbers were 87.6% and 85.4% — within normal run-to-run variance. The viral result was noise amplified by confirmation bias. Rumors persist that providers silently quantize weights during peak load — dropping Opus to Int4 or 1.58-bit precision, or compressing the KV cache, to save compute on expensive models. No weight probe, perplexity study, or latency fingerprint has been published showing quantization artifacts, and the symptoms heavy users describe match reasoning-budget cuts and harness bloat, not the perplexity spikes that characterize aggressive quantization. The quantization theory remains unverified speculation — but it is unfalsifiable without provider-side access, which is precisely the problem. The operator of a hosted API has the ability to do this; the customer has no way to prove they did or did not.

The silent-downgrade bug class

A separate cluster of complaints — "I paid for Opus, I got Sonnet" or "I got Haiku" — turns out to be a real, documented bug class, but not a policy. GitHub issue #19468 and its duplicates (#3434, #6602, #13242, #17966) document client-side failures in the Claude Code harness: falling back to Sonnet after an Opus quota cap, serving Sonnet despite explicit Opus configuration, configuration files being ignored after harness updates, and reauthentication silently reverting the selected model to Haiku. These are implementation defects in the client, not backend model substitution. The user experience is identical to a hidden downgrade, which is what makes the deception theory persistent even when the mechanism is actually a bug. The transparency failure is real; the malice is not established.

The cross-provider pattern

This pattern is not Anthropic-specific. Chen, Zaharia, and Zou (2023) measured GPT-4 falling from 84% to 51% on prime-number identification between March and June 2023 on identical prompts — controlled snapshot comparison, published in a peer-reviewed preprint. OpenAI acknowledged GPT-4 Turbo "laziness" in December 2023 and shipped a fix in January 2024. Independent researchers found GPT-4 producing roughly 5% shorter responses when prompted with a December date vs. a May date, suggesting the model had internalized holiday-slowdown patterns from training data. Gemini 2.5 Pro faced identical community complaints in Q4 2025 and Q1 2026 — hallucinations, timeouts, degraded reasoning — with users speculating Google was starving Pro of TPU capacity to make room for 3.0 (unconfirmed). The underlying causes need not require malice: task-difficulty ratchet as users push harder problems, novelty effect wearing off, RLHF regression, silent product-configuration changes, infrastructure bugs, and selection bias among power users who post loudly when things feel worse. The universality is Bayesian evidence against any specific vendor being uniquely deceptive. It is also Bayesian evidence that any API-dependent operation will eventually experience this.

Reproducibility is structural

A self-hosted model with a given SHA256 hash produces the same output distribution today, in a year, and in five years. The reasoning budget is whatever VRAM and latency tolerance you allocate, not what a classifier at the provider decides under peak load. There is no adaptive-thinking mis-budgeting, no prompt-cache bug inflating cost 10-20x, no 1M context recall degrading past roughly 48% utilization as documented in GitHub issue #35296, no silent client-side downgrade from Opus to Sonnet after a quota hit, no silent display-default flip between model versions, no auto-compact summarising away the context you needed, and no rumored peak-hour quantization that cannot be confirmed or denied. Changes happen when you update the weights file. Reverting is copying the old file back. Pinning to a specific checksum is a guarantee the provider cannot make.

For API-dependent workloads that cannot move to local inference, the practical mitigations are narrow: set `effort` to `high` or `max` explicitly rather than relying on defaults, pin `thinking.type` and `thinking.display` explicitly rather than letting model-version upgrades change them silently, treat 256K–400K as the practical context limit even on advertised 1M windows, and document the exact model snapshot (e.g. `claude-opus-4-6-20260201`) your pipeline was validated against. Priority Tier for production workloads gives predictable quota behavior. Cross-provider redundancy reduces single-vendor dependency. None of these mitigations restore byte-identical reproducibility — they buy you fewer surprises, not zero.

For workloads where reproducibility matters more than frontier quality — regulated industries with audit requirements, long-horizon research spanning model versions, agent pipelines where behavior changes break production — model stability is a feature APIs structurally cannot match. It does not show up in cost-per-token comparisons. It compounds over years.

When APIs win

For coding, multi-step reasoning, and agentic workflows, APIs win on both quality and effective cost.

A B+ local model looks adequate on simple prompts. The gap shows on hard problems: multi-file refactoring, agent coherence past 20 turns, judgment calls in ambiguous situations. The output compiles. It does the wrong thing. These are the tasks where LLMs are worth using, and where frontier models justify their pricing.

At 20-30% utilization, APIs often cost less per token than self-hosted inference. Add maintenance (driver bugs, cooling, kernel compatibility, GPU replacement) and the gap widens.

The DeepSeek V3 case study

DeepSeek V3.2 API costs about EUR 260/month for a moderate coding workload (130K queries/month averaging ~5K tokens each — typical for code completion, short Q&A, and lightweight agent calls). At heavier usage with long context (e.g. 74K tokens per query), the same 130K queries would cost ~EUR 2,500/month. Self-hosting the full DeepSeek V3 model requires 8x H100 SXM GPUs.

Configuration Upfront cost Monthly operating vs API (EUR 260/mo, moderate) vs API (EUR 2,500/mo, long-context)
DeepSeek V3 on 8x H100 (new) EUR 320K EUR 9,600 37x more 3.8x more
DeepSeek V3 on 8x H100 (used) EUR 200K EUR 6,260 24x more 2.5x more
DeepSeek V3 on 2x MI300X EUR 37K EUR 1,230 5x more 0.5x (cheaper)
Qwen3-30B-A3B on 2x RTX 4090 (substitute) EUR 7,600 EUR 350 35% more (lower quality) 7x cheaper (much lower quality)

At moderate usage (EUR 260/month), self-hosting loses badly. At heavy long-context usage (EUR 2,500/month), self-hosting on MI300X starts to break even — but you are running the full 685B model, which requires specialized hardware. The consumer GPU substitute (Qwen3-30B-A3B) is a different, smaller model with lower quality.

The API is almost certainly priced below marginal cost for individual users. DeepSeek benefits from massive scale, strategic pricing, and Chinese electricity rates. At moderate usage, competing on price by buying your own hardware is a losing proposition. At heavy long-context usage, self-hosting can break even on operating costs — but the upfront hardware investment (EUR 37K-320K for the full model) means years to payback.

API pricing is variable: pay for what you use. GPU hardware is a fixed cost that depreciates regardless. Workloads that vary week to week waste less money on APIs.

At Pulsed Media

Pulsed Media runs its own hardware in owned datacenter space in Finland. GPU inference cards like the RTX Pro 6000 go through the same depreciation, power cost, and rack space analysis as any other piece of infrastructure.

For most workloads in 2026, cloud APIs deliver better quality per euro than local inference. Bulk classification, embeddings, and privacy-constrained tasks are where the GPU math works. 16 years of hardware purchasing says: buy when the math supports it, use the API when it does not.

Seedboxes and dedicated servers from Pulsed Media run on the same Finnish datacenter infrastructure, with the same cost-per-unit discipline applied to every piece of hardware. See current plans.

References

Benchmarks and performance data

  • LMSYS Chatbot Arena ELO Leaderboard — model quality rankings used for Arena ELO scores throughout this article
  • llama.cpp Vulkan GPU Scoreboard — community-submitted GPU inference benchmarks (Llama 2 7B Q4_0, tg128). Source for RX 5700 XT (70.73 t/s), GTX 1050 Ti (20.96 t/s), RTX 3060 (75.94-80.59 t/s), and other consumer GPU speeds
  • llama.cpp — the inference engine behind Ollama and most local LLM benchmarks
  • vLLM — batched serving engine; source for Pro 6000 throughput numbers (8B at 8,990 t/s, 70B at 1,031 t/s)

Model cards and technical reports

Research papers

CPU inference and memory bandwidth

  • AMD EPYC 9554 llama.cpp benchmarks (ahelpme.com) — EPYC 9554 achieves 50 t/s on 8B Q4_K_M, confirming memory bandwidth as primary bottleneck
  • OpenBenchmarking.org llama.cpp results — 88 public CPU benchmark results, median 14.2 t/s on 8B Q8_0
  • ik_llama.cpp CPU performance — Ryzen 7950X benchmarks showing 1.8-5.2x speedup over mainline llama.cpp for prompt processing
  • ik_llama.cpp — optimized llama.cpp fork with improved CPU kernels
  • CPU token rate formula: theoretical max t/s ≈ memory bandwidth (GB/s) / model size (GB); actual ~50-70% of theoretical due to overhead

Hardware specifications

  • Speeds in this article use Q4_K_M quantization unless otherwise noted. Consumer GPU system power includes GPU + ~120W for the rest of the system; enterprise GPUs use +200W for server chassis. Electricity rate: EUR 0.15/kWh (Finnish residential average). Enterprise GPU purchase prices are estimated April 2026 market rates and shift rapidly
  • VRAM context limits use Llama 3.1 8B KV cache parameters (128 MB/1K tokens at FP16, 64 MB/1K at Q8, 32 MB/1K at Q4). Other model architectures vary; Qwen3 8B uses ~144 MB/1K and Gemma 4 31B dense uses ~850 MB/1K due to its 16 KV-head architecture

Model stability and provider changes

Primary sources (Anthropic)
User telemetry and bug reports
  • GitHub issue anthropics/claude-code #42796 — AMD engineer Stella Laurenzo's telemetry from 6,852 Claude Code sessions: 73% reasoning-depth collapse (~2,200 → ~600 chars), Read:Edit ratio collapse (6.6 → 2.0), 80x increase in deduplicated API request volume, 170x input-token growth, 122x daily-cost growth, 156% reasoning-loop growth, 556% user-interrupt growth (January → March 2026)
  • GitHub issue anthropics/claude-code #35296 (filed 17 March 2026) — Claude Code user documenting 1M context window degradation, including the model itself recommending session restart by 48% context utilization
  • GitHub issue anthropics/claude-code #19468 — tracks the client-side Opus→Sonnet/Haiku downgrade bug class (duplicates #3434, #6602, #13242, #17966): settings ignored, quota fallback without notice, reauthentication reverting to Haiku
  • GitHub issue anthropics/claude-code #16073 — early January 2026 quality-regression complaint with session-level detail
  • GitHub issue anthropics/claude-code #49593 — ~14% context-window bloat introduced in Claude Code v2.1.111; Tool Search cut MCP overhead 46.9% (51K → 8.5K tokens), indicating the baseline was severe
Cross-provider and historical context
Trade press coverage of the 2026 Claude episode
  • Fortune (14 April 2026) — customer-backlash reporting tied to the adaptive-thinking default and effort-level reductions
  • The Register (31 March 2026) — coverage of the Claude Code quota-blowup episode, including Anthropic's acknowledgment that users were hitting limits faster than expected
  • InfoWorld (late March 2026) — reporting on Claude Code peak-hour session-cap tightening (US weekday 5am–11am PT / 1pm–7pm GMT window)

Trade-press citations above are included for context and were used as corroboration; primary-source material from Anthropic's own documentation, postmortems, and GitHub issue tracker is preferred for any factual claim in this article.

See also

  • NVIDIA RTX Pro 6000 (Blackwell) — full specs, benchmarks, and known issues for the 96 GB GPU referenced throughout this article
  • Seedbox — PM's primary product, where the same hardware cost economics apply
  • NVMe — storage interface used alongside GPU inference servers
  • RAID — storage redundancy in PM's infrastructure