Self-Hosting LLMs vs API
Pulsed Media owns its datacenter space and runs its own hardware. GPU inference cards get evaluated against the same cost-per-unit math that prices seedboxes and dedicated servers. In April 2026, the math says: cloud APIs win for quality-sensitive work, self-hosted GPUs win for bulk processing and privacy-constrained workloads.
This article has the hardware benchmarks, API pricing, and economics to show why.
The quality problem
The first obstacle is not cost. It is quality.
The best open-weight model you can run on a single consumer GPU (24 GB) is Gemma 4 31B. It scores approximately 1450 on the LMSYS Chatbot Arena ELO scale (arena.ai, preliminary, April 2026). That is 20 points below Sonnet and 50 below Opus. The gap sounds small until you measure it on hard tasks.
| Model | Arena ELO (approx) | Access | Hardware needed |
|---|---|---|---|
| Claude Opus 4.6 | ~1500 | API / subscription | Cloud |
| Claude Sonnet 4.6 | ~1470 | API / subscription | Cloud |
| GLM-5 744B (dense) | ~1451 | Open weights | 6x 96 GB GPUs (~$48,000) |
| Gemma 4 31B (dense) | ~1450 | Open weights | 1x 24 GB GPU at Q4 (~$1,600) |
| Kimi K2.5 1T (MoE) | ~1447 | Open weights | 8x 96 GB GPUs (~$64,000) |
| Gemma 4 26B-A4B (MoE) | ~1441 | Open weights | 1x 24 GB GPU at Q4 (~$1,600) |
| Qwen3-235B-A22B (MoE) | ~1422 | Open weights | 2x 96 GB GPUs (~$16,000) |
| DeepSeek V3.2 685B (MoE, 37B active) | ~1421 | Open weights | 5x 96 GB GPUs (~$40,000) |
| DeepSeek R1 671B (MoE, reasoning) | ~1398 | Open weights / API | 5x 96 GB GPUs (~$40,000) |
| Llama 3.3 70B (dense) | ~1350-1400 | Open weights | 1x 96 GB GPU (~$8,500) |
The table splits into two tiers. Models you can run on a single GPU ($1,600-8,500) top out at ELO ~1450. Models that approach Sonnet territory (GLM-5 at 1451, Kimi K2.5 at 1447) require multi-GPU setups. The costs shown assume RTX Pro 6000 cards at ~$8,000 each via PCIe — but without NVLink, tensor parallelism runs at roughly one-third the throughput of datacenter H100s. An NVLink-equipped H100 SXM configuration running the same models costs 3-5x more ($150,000-320,000).
Gemma 4 31B scores 80.0% on LiveCodeBench v6 and 84.3% on GPQA Diamond. These are strong numbers for an open model. But on agentic workflows where models must maintain coherence across 20+ tool calls, open models fail silently: the output compiles but does the wrong thing, or the agent loops instead of converging. These failures are expensive because nobody notices until the result is wrong.
No amount of hardware spending changes this. The quality ceiling is in the model weights. Even $64,000 in GPUs running Kimi K2.5 does not reach Sonnet.
Quality vs hardware: what each GPU tier can actually run
The quality gap depends on what fits on the hardware you own.
| VRAM | Best model (Q4) | Arena ELO | Quality tier | Gap to Sonnet |
|---|---|---|---|---|
| 4 GB | Phi-4 Mini 3.8B or smaller | ~1100-1200 | D | ~270+ ELO |
| 8 GB | Llama 3.1 8B (tight) | ~1250-1300 | C/C+ | ~170-220 ELO |
| 12 GB | Qwen3 14B | ~1350 | B- | ~120 ELO |
| 16 GB | Qwen3 14B (comfortable) | ~1350 | B- | ~120 ELO |
| 24 GB | Gemma 4 31B | ~1450 | B+/A- | ~20 ELO |
| 32 GB | Gemma 4 31B at Q8 | ~1450+ | A- | ~20 ELO |
| 48 GB | Llama 3.3 70B Q4 | ~1350-1400 | B/B+ | ~70-120 ELO |
| 96 GB | Llama 3.3 70B Q8 | ~1400 | B+ | ~70 ELO |
At 4-8 GB VRAM, you are running models that struggle with multi-step reasoning and produce noticeably worse output than a $0.04/MTok API call to Mistral Nemo. At 24 GB, Gemma 4 closes the ELO gap but still falls short on the hardest tasks. Only at 96 GB do you get a 70B model at full quality, and even that is still B+ tier versus API frontier at A+/S.
What 4-8 GB VRAM can run
Most consumer GPUs sold in the last five years have 4-8 GB of VRAM. Can they do anything useful with local LLMs?
4 GB (GTX 1050 Ti, RX 570, integrated graphics)
At 4 GB, the only models that fit are sub-3B at Q4:
| Model | Size at Q4 | Speed (est.) | Useful for |
|---|---|---|---|
| Qwen3 0.6B | ~0.5 GB | 30-50 t/s | Text classification, simple extraction |
| Llama 3.2 1B | ~0.8 GB | 20-30 t/s | Basic summarization, translation |
| TinyLlama 1.1B | ~0.7 GB | 25-35 t/s | Research, experimentation |
| BitNet b1.58 2B | 0.4 GB | 10-20 t/s (CPU) | Research only |
| Phi-4 Mini 3.8B | ~2.3 GB | 15-25 t/s | Simple coding, Q&A |
These models cannot write useful code, follow complex instructions, or maintain coherence past a few exchanges. A 1B model completing a sentence is not the same as a 70B model reasoning about a problem. The quality difference is categorical.
8 GB (RTX 3060 8GB, RTX 4060, RX 6600)
At 8 GB, a 7B/8B model at Q4 fits with room for ~32K context (Q8 KV cache):
| Model | Size at Q4 | Speed (GPU) | Max context (Q8 KV) | Quality |
|---|---|---|---|---|
| Llama 3.1 8B | ~4.8 GB | ~40-50 t/s | ~32K | C+ |
| Qwen3 8B | ~5.2 GB | ~35-45 t/s | ~28K | C+ |
| Gemma 4 E4B | ~5 GB | ~35-45 t/s | ~28K | C |
| Mistral Nemo 12B | ~7.4 GB | ~25-35 t/s | ~4-8K | C+ |
An 8B model on an 8 GB GPU is the minimum setup that produces output comparable to budget APIs. The constraint is context: at Q4 KV cache, 64K tokens are feasible but tight. At Q8 KV (recommended for retrieval tasks), the ceiling is ~32K.
Mistral Nemo 12B fits at Q4 but leaves almost no room for KV cache. Short conversations only.
Memory bandwidth is the bottleneck
LLM token generation is memory-bandwidth-bound. The GPU reads model weights from VRAM for every single token. Read speed determines generation speed.
| Platform | Memory bandwidth | Llama 3.3 70B Q4 tok/s | Relative speed |
|---|---|---|---|
| RTX Pro 6000 | 1,792 GB/s | ~34 | 1.0x |
| H100 SXM | 3,352 GB/s | ~40 | 1.2x |
| Strix Halo (128 GB unified) | 215 GB/s | ~4.5 | 0.13x |
| DDR5 desktop (dual channel) | ~89 GB/s | ~2.2 | 0.06x |
| DDR4-3200 desktop (dual channel) | ~42 GB/s | ~1 | 0.03x |
The Pro 6000 and H100 are close on single-card workloads because both have enough bandwidth to keep a 70B model fed. The Pro 6000 compensates with more aggressive quantization (Q4 vs FP8), fewer bytes per weight, similar throughput. The H100 pulls ahead with NVLink multi-GPU tensor parallelism, where its 900 GB/s interconnect leaves PCIe (128 GB/s) behind.
A Strix Halo system has 215 GB/s. Same model, same quantization, 8.3x less bandwidth, roughly 8x slower. A desktop CPU on DDR5 dual-channel has ~89 GB/s. 20x slower than the Pro 6000.
No amount of CPU cores or compiler flags changes this. Bandwidth is physics. The cheapest path to 1,792 GB/s in April 2026 is the RTX Pro 6000 at $8,500.
The rough formula: max tokens/sec = Memory Bandwidth (GB/s) / Model Size (GB). Actual throughput is about 50-70% of theoretical. An EPYC 9554 with ~500 GB/s measured bandwidth hits 50 t/s on an 8B Q4 model, which is 55% of the theoretical 111 t/s.
Hardware comparison
Professional GPUs
The RTX Pro 6000 is the only GPU under $10,000 that runs 70B models on a single card without quantizing below Q4. Full specs and known issues are on the dedicated page.
| Model | Quantization | Tokens/sec (generation) |
|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~185 |
| Mistral Nemo 12B | Q4_K_M | 163 |
| Qwen3 30B-A3B (MoE) | Q4_K_M | 252 |
| Llama 3.3 70B | Q4_K_M | 34 |
34 tokens per second on 70B is about 25 words per second. Fast enough that you are not waiting for the model to finish a paragraph.
With vLLM batched serving (concurrent requests), throughput goes well beyond single-stream: 8B at 8,990 t/s, 70B at 1,031 t/s. On models that fit on one card, the Pro 6000 matches the H100 SXM at one-third the cost.
The card has real problems. A virtualization reset bug causes unrecoverable GPU state after VM shutdown. Sustained vLLM inference triggers chip resets at temperatures as low as 28C. The SM120 kernel architecture breaks DeepSeek models. Only the open-source driver works on Blackwell; there is no proprietary option.
Consumer GPUs
| GPU | VRAM | Bandwidth | Llama 3.1 8B Q4 t/s | Max model (Q4) | Price (Apr 2026) |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | 1,790 GB/s | ~300 | ~32B | $2,900-4,200 |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | ~128 | ~24B | $1,600-2,200 |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | ~87-112 | ~32B tight | $800-900 used |
| RTX 4000 SFF Ada | 20 GB GDDR6 | 280 GB/s | ~64 | ~13B | ~$1,250 |
The RTX 5090 has bandwidth matching the Pro 6000 (1,790 GB/s) but only 32 GB of VRAM, which limits it to 32B-class models at Q4. For sub-32B workloads, the 5090 is better price-performance than the Pro 6000.
The RTX 3090 is the last consumer card with NVLink. Two of them ($1,600-1,800 used) give 48 GB combined, enough for 32B models with generous context. For the price of a single RTX 5090, a dual-3090 setup gets similar bandwidth and 50% more VRAM.
Gemma 4 on consumer GPUs: dense vs MoE
Gemma 4 is available in two forms that fit on 24 GB: the 31B dense model and the 26B-A4B MoE. The MoE activates only 3.8B parameters per token, making it dramatically faster.
| Variant | Arena ELO | RTX 4090 speed | RTX 3090 speed | VRAM (Q4) | 256K context? |
|---|---|---|---|---|---|
| 31B Dense | ~1450 | ~25-35 t/s (short ctx) | ~20-35 t/s | 19.6 GB | No (KV cache fills 24 GB at ~32-64K) |
| 26B-A4B MoE | ~1441 | ~50-129 t/s | ~35-40 t/s | ~15.6 GB | Yes (8.4 GB headroom for KV) |
The 31B dense model has a disproportionately large KV cache (~0.85 MB per token at BF16) because of its 16 KV-head architecture. Despite fitting at Q4, context expansion rapidly consumes remaining VRAM. At VRAM-saturated configurations, the RTX 4090 drops to 7.8 t/s on the 31B.
The 26B-A4B MoE uses 4x less KV cache (~0.21 MB/token), runs 2-4x faster, and scores within 9 ELO points. For interactive use on 24 GB hardware, the MoE is the correct choice. The dense model is worth choosing only for maximum coding quality at short context (<8K tokens).
Compact workstations
The Minisforum MS-02 Ultra won a CES 2026 Innovation Award. Compact 4.8L chassis, Intel Core Ultra 9 285HX, up to 256 GB DDR5 ECC, four NVMe slots. Looks great on paper.
The catch: the PCIe slot is low-profile dual-slot only. The best GPU that fits is an RTX 4000 SFF Ada with 20 GB VRAM and 280 GB/s bandwidth. That gets ~64 t/s on an 8B model and cannot run anything above 13B. The MS-02 Ultra is a homelab machine, not an inference server.
Strix Halo
AMD's Ryzen AI Max+ 395 (Strix Halo) puts 128 GB of unified LPDDR5x memory on a single chip. The iGPU can address up to 96 GB as VRAM, matching the RTX Pro 6000 on capacity.
Bandwidth tells the real story: 215 GB/s measured, versus the Pro 6000's 1,792 GB/s. Every model runs 4-8x slower. 70B at 4.5 tokens per second works, but it is painfully slow.
Where Strix Halo is useful: running large MoE models that would not fit on a 24-32 GB consumer GPU. Qwen3 30B-A3B at 66-72 t/s in 128 GB unified memory, or larger MoE models at 20+ t/s. The 128 GB pool is the point, not the speed.
At EUR 2,000-3,000 for a complete system consuming 120W, it costs 3-5x less than an RTX Pro 6000 setup. Same model capacity, 8x less speed.
CPU inference
Server-class CPUs with enough memory channels can run LLMs at usable speeds for small models:
| CPU | Memory bandwidth | 8B Q4 t/s | 70B Q4 t/s | Usable for chat? |
|---|---|---|---|---|
| EPYC 9554 (64c, 8-ch DDR5) | ~500 GB/s | ~50 | ~7 | 8B yes, 70B batch only |
| Dual Xeon Gold 5317 | ~80 GB/s | ~22 | ~3 | 8B marginal, 70B no |
| Desktop DDR5 (dual-channel) | ~89 GB/s | ~20 | ~2 | 8B marginal, 70B no |
| Desktop DDR4 (i5-7500T) | ~38 GB/s | ~4-5 | — | Basic completion only |
CPU inference makes sense when you need to run a model that does not fit in any available VRAM and buying a GPU is not an option. An EPYC with 12-channel DDR5 and 512+ GB RAM can run 70B at 7 t/s. Batch processing, not interactive.
The optimized fork ik_llama.cpp gets 1.8-5x faster prompt processing than mainline llama.cpp on CPUs with AVX-512, though generation speed gains are more modest.
The smallest useful CPU models
For CPU-only machines without a GPU, the models worth considering are limited:
| Model | Parameters | RAM needed | Desktop CPU speed | Quality |
|---|---|---|---|---|
| BitNet b1.58 2B4T | 2.4B | ~0.4 GB | 10-20+ t/s | Research-grade; MMLU 53 |
| Llama 3.2 1B | 1.3B | ~0.8 GB | 20-30 t/s | Basic tasks only |
| Llama 3.2 3B | 3.2B | ~2 GB | 10-15 t/s | Simple Q&A, summarization |
| Phi-4 Mini 3.8B | 3.8B | ~2.3 GB | 7-10 t/s | Simple coding, reasoning |
| Llama 3.1 8B | 8B | ~4.8 GB | 4-5 t/s (DDR4) / 15-20 t/s (DDR5) | C+ tier; minimum useful |
Models under 3B parameters exist primarily for research, experimentation, and edge deployment (phones, IoT). TinyLlama, SmolLM, and similar projects show what sub-3B models can do, but their output quality is far below what budget APIs provide for $0.04/MTok.
The 8B tier on CPU is the minimum for genuinely useful output. Below that, you are trading quality for the ability to run offline. That is a valid tradeoff for privacy-constrained or air-gapped environments, not a cost savings.
The context window problem
Real workloads routinely exceed 128K tokens. A medium codebase is 200K-500K tokens. A 50-page contract is 40K tokens. An agentic session with 30+ tool calls accumulates 100K-300K tokens. Document collection analysis, multi-session memory, and RAG over large corpora push past 500K easily.
Gemini 2.5 Pro handles 1 million tokens natively. Claude Opus 4.6 and Sonnet 4.6 handle 1M. GPT-4.1 takes 1M. These are the context windows where real work happens.
Open models top out at 128K. That is not a VRAM limitation — it is a training limitation. No open model available in April 2026 was trained on context beyond 128K-256K, and quality degrades well before the stated maximum. This is the single largest gap between self-hosted and API inference, and no amount of hardware closes it.
VRAM limits on context
Every token in context consumes VRAM for its KV cache entry. The formula: 2 x layers x KV_heads x head_dim x bytes_per_element. For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim), that is ~128 MB per 1,000 tokens at FP16. Q8 KV halves that to ~64 MB/1K with negligible quality loss. Q4 KV halves again but degrades retrieval accuracy.
Trained context windows cap the usable range regardless of VRAM:
| Model | Trained context | Notes |
|---|---|---|
| Llama 3.1 / 3.3 | 128K | Best open 70B; hard ceiling at 128K |
| Qwen3 | 128K | YaRN-extended; quality degrades past training range |
| Gemma 4 | 128K | |
| Mistral Nemo | 128K | |
| Gemini 2.5 Pro (API) | 1,000K | 8x the best open model |
| Claude Opus 4.6 / Sonnet 4.6 (API) | 1,000K | 8x the best open model |
| GPT-4.1 (API) | 1,000K | 8x the best open model |
Every open model stops at 128K. All three frontier API providers — Anthropic, Google, and OpenAI — now offer 1M token context windows. For workloads that routinely exceed 128K — codebase analysis, long document chains, agent sessions — this gap is not solvable with more VRAM. The model simply was not trained for it.
All context numbers below use Q8 KV cache (recommended — negligible quality loss vs FP16, half the VRAM). Q4 KV doubles these limits but degrades retrieval accuracy. Models at Q4_K_M weights throughout.
| VRAM | Model | Max context (Q8 KV) | Max context (Q4 KV) | Notes |
|---|---|---|---|---|
| 8 GB | 8B | ~36K | ~73K | Minimum useful setup |
| 14B+ | Does not fit | |||
| 12 GB | 8B | ~112K | 128K (trained limit) | 128K with Q4 KV only |
| 14B | ~43K | ~86K | ||
| 16 GB | 8B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV |
| 14B | ~93K | 128K (trained limit) | ||
| 24 GB | 8B | 128K (trained limit) | 128K (trained limit) | Full context |
| 14B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV | |
| 32B | ~35K | ~70K | Short context only | |
| 48 GB | 14B | 128K (trained limit) | 128K (trained limit) | Full context |
| 32B | 128K (trained limit) | 128K (trained limit) | Full context at Q8 KV | |
| 70B | ~46K | ~92K | Short-to-medium context | |
| 80 GB | 32B | 128K (trained limit) | 128K (trained limit) | Full context |
| 70B | ~125K | 128K (trained limit) | Full context at Q4 KV; near-full at Q8 | |
| 96 GB | 70B | 128K (trained limit) | 128K (trained limit) | Full context; best single-GPU for 70B |
| 128 GB (Strix Halo / CPU) | 70B | 128K (trained limit) | 128K (trained limit) | Full context; ~4.5 t/s (slow) |
KV cache per token at Q8: 8B = 64 MB/1K, 14B = 80 MB/1K, 32B = 128 MB/1K, 70B = 160 MB/1K. Q4 halves these. 1 GB headroom reserved for runtime overhead.
For CPU/RAM inference, the same arithmetic applies but "VRAM" becomes "available RAM after OS and model." A 64 GB DDR4 system running 8B Q4 (4.7 GB model + OS overhead) has roughly 55 GB available for KV cache — same capacity as a 48 GB GPU, but 10-15x slower due to DDR4 bandwidth (~42 GB/s vs GPU's 900+ GB/s).
Context rot: accuracy degrades with length
Even when the VRAM fits, model accuracy drops as context grows. The Chroma 2025 study tested 18 models and found 20-50% accuracy degradation from 10K to 100K tokens across every model tested, including frontier APIs. Open models degrade faster.
The worst-case position is the middle of the context window. Beginning and end receive stronger attention (the "lost in the middle" effect, Liu et al. 2024). For retrieval tasks where the target information can appear anywhere, this creates systematic blind spots.
On harder retrieval tasks requiring ordered extraction across long context (Sequential-NIAH, arXiv 2504.04713, EMNLP 2025), the best model tested scored 63.5%. Standard "find the passkey" benchmarks show 90-100% at 128K; the realistic number for semantic retrieval is 50-65%. This is a model capability ceiling, not a hardware limitation.
Non-literal retrieval is worse still. The NoLiMa benchmark (ICML 2025) tested 13 LLMs on tasks requiring semantic similarity matching rather than exact text lookup. Open-source models showed inverted U-shaped performance curves beyond critical context thresholds — accuracy degrades substantially when the task requires reasoning about meaning rather than matching strings.
KV cache quantization compounds these losses. FP16 and Q8 KV cache produce no measurable retrieval degradation. Q4 KV cache adds detectable accuracy loss on top of context rot, particularly on retrieval tasks. For any workload where finding information in context matters, Q8 KV cache is the minimum — Q4 saves VRAM at the cost of making the context less reliable.
TurboQuant: KV cache compression (ICLR 2026)
TurboQuant (Google Research, ICLR 2026) compresses KV cache to 2.5-3.5 bits per channel with near-zero quality loss. This is not weight quantization — it is complementary to Q4_K_M model weights. You can run a model at Q4 weights AND TurboQuant 3-bit KV cache simultaneously.
| KV cache method | Bits/channel | LongBench score | Needle retrieval | VRAM per 1K tokens (8B model) |
|---|---|---|---|---|
| FP16 (baseline) | 16 | 50.06 | 0.997 | 128 MB |
| Q8 (current standard) | 8 | ~50.06 | ~0.997 | 64 MB |
| TurboQuant | 3.5 | 50.06 | 0.997 | ~28 MB |
| TurboQuant | 2.5 | 49.44 | — | ~20 MB |
| Q4 (current) | 4 | degrades | degrades | 32 MB |
| KIVI 3-bit | 3 | 48.50 | 0.981 | ~24 MB |
At 3.5 bits, TurboQuant matches FP16 quality exactly — identical LongBench score, identical needle retrieval. At 2.5 bits, it outperforms KIVI at 3 bits. The method is data-oblivious: no calibration data, no per-model tuning. It works by applying a random orthogonal rotation that makes vector coordinates near-independent, then using mathematically optimal scalar quantizers.
If TurboQuant 3-bit KV becomes standard, the VRAM context table above roughly doubles: an 8B model on 24 GB at 3-bit KV would reach 128K comfortably instead of needing Q8 at 128K. A 70B model on 96 GB at 3-bit KV would push past 200K context.
Status (April 2026): No official Google implementation. Community ports exist for llama.cpp (TQ3_0 format, not merged), vLLM (Triton kernels, not merged), and PyTorch/HuggingFace. None are in mainline frameworks yet. Official integration expected Q2-Q3 2026.
Hardware cost to match API context
| API capability | Best local equivalent | Hardware cost | What you actually get |
|---|---|---|---|
| GPT-4.1 at 128K (A- quality) | 8B Q4 on RTX 4090 | ~EUR 1,800 | 128K context but C+ quality — 2 tiers below |
| GPT-4.1 at 128K (A- quality) | 70B Q4 on 2x RTX 4090 | ~EUR 3,500 | 128K context at B+ quality — 1 tier below |
| Claude 1M (A+ quality) | 70B Q4 on A100 80GB | ~EUR 15,000 | Capped at 128K (872K shorter), B+ quality |
| Claude 1M (A+ quality) | Nothing | — | No open model trained past 128K |
| Gemini 1M (A+ quality) | Nothing | — | Not achievable locally at any price |
The gap is stark beyond 128K. No combination of hardware and open models reaches 200K context. All three frontier API providers now offer 1M at A+ quality. Self-hosted tops out at 128K at B+ quality. For workloads that need the full context — codebase analysis, long document processing, extended agent sessions — APIs are the only option.
Working around the 128K ceiling
Several techniques exist to process more data than fits in a single context window. None of them are transparent substitutes for native long context.
RAG (Retrieval-Augmented Generation) is the most production-ready approach. Split documents into chunks, embed them in a vector database, retrieve only the relevant chunks per query. Quality reaches 70-85% of native long context for retrieval tasks (EMNLP 2024), but fundamentally cannot do cross-document reasoning — connecting a clause on page 40 to a definition on page 3 requires both to be in context simultaneously. RAG turns the problem from "LLM reasons over all data" into "retrieval system selects data, LLM reasons over selection." The retrieval quality becomes the ceiling.
Context compression (LLMLingua, Microsoft) removes unimportant tokens to fit more into the same window. At 2-4x compression, quality loss is minimal. At 10-20x, degradation is measurable. A 128K window with 4x compression gives roughly 512K effective tokens — useful, but still below 1M native context and with quality loss on compressed portions.
Map-reduce / chunking processes each chunk independently, then combines results. Works for summarization (60-80% quality). Fails for reasoning that requires cross-chunk information.
Hierarchical summarization — summarize chunks, then summarize summaries — works for gist but fails for precision. Each summarization round is lossy, and hallucination amplification compounds: a fabricated fact in round 2 becomes "established fact" in round 3.
StreamingLLM solves a different problem. It enables continuous generation over arbitrarily long streams by preserving "attention sink" tokens, but the model can only see the current window. Information outside the window is gone. Useful for long chat sessions, not for document analysis.
| Technique | Production ready? | Quality vs native long context | Real substitute? |
|---|---|---|---|
| RAG | Yes | 70-85% (retrieval) / poor (reasoning) | Partial — good for search, poor for synthesis |
| LLMLingua compression | Yes | 85-95% at 2-4x compression | Partial — extends window modestly |
| Map-reduce / chunking | Yes | 60-80% depending on task | No — loses cross-chunk connections |
| Hierarchical summarization | Yes (for summaries) | 50-70% for detail; good for gist | No — lossy compression compounds |
| StreamingLLM | Yes (for chat) | N/A (different problem) | No — does not extend reasoning context |
The LaRA benchmark (ICML 2025) tested 2,326 cases across multiple tasks and concluded there is no silver bullet — the best approach depends on task type, model capabilities, and retrieval characteristics. One concrete data point: switching from chunked medical records to full-context patient histories improved diagnostic accuracy by 23% because the model could see temporal patterns across years of data.
The "lost in the middle" problem (Liu et al., TACL 2024) complicates even native long context: models perform best when relevant information is at the beginning or end of the context, with 30%+ performance degradation for information in the middle. This affects all models, including those designed for long context.
For self-hosters, the practical hierarchy is: if the workload fits within 128K, local models are competitive. If it needs 128K-512K, RAG or compression can partially bridge the gap with measurable quality penalties. Beyond 512K, APIs with native 1M context are the only reliable option.
API pricing (April 2026)
API pricing dropped roughly 80% since early 2025. Budget-tier models now cost $0.02-0.40 per million tokens.
| Tier | Model | Input $/MTok | Output $/MTok | Context | Quality |
|---|---|---|---|---|---|
| Budget | Mistral Nemo | $0.02 | $0.04 | 128K | C+ |
| Budget | GPT-4.1-nano | $0.05 | $0.20 | 1M | C+ |
| Budget | GPT-4.1-mini | $0.16 | $0.64 | 1M | B- |
| Mid | DeepSeek V3.2 | $0.28 | $0.42 | 128K | B+ |
| Mid | Gemini 2.5 Flash | $0.15 | $3.50 | 1M | B+ |
| Mid | GPT-4.1 | $2.00 | $8.00 | 1M | A- |
| Mid | Grok 3 | $3.00 | $15.00 | 128K | A- |
| Mid | Claude Haiku 4.5 | $1.00 | $5.00 | 200K | A- |
| Frontier | Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | A+ |
| Frontier | Gemini 2.5 Pro | $1.25 | $10.00 | 1M | A+ |
| Frontier | Claude Opus 4.6 | $5.00 | $25.00 | 1M | S |
| Frontier | OpenAI o3 | $2.00 | $8.00 | 200K | S (reasoning) |
Subscription tiers exist for heavy users: Claude Pro at $20/month, Claude Max at $100-200/month with higher rate limits. Google, OpenAI, and others have similar tiered access. The per-token rates above are the pay-as-you-go ceiling, not the effective cost for regular users.
Cached input pricing drops costs further. Claude cache hits cost 0.1x the base rate. Gemini 2.5 Flash cached input is $0.03/MTok. Workloads with repeated context (RAG, agent loops) benefit from caching.
The cost comparison
Self-hosted at 100% utilization
Best case for self-hosting: a GPU running batch inference around the clock.
| Setup | Model | Monthly cost (3yr amort.) | Tokens/month | EUR/MTok | Quality |
|---|---|---|---|---|---|
| RTX Pro 6000 | 70B vLLM batched | ~EUR 380 | ~2.7B | EUR 0.14 | B/B+ |
| RTX Pro 6000 | 8B vLLM batched | ~EUR 380 | ~23B | EUR 0.017 | C+ |
| 2x RTX 3090 | 8B single-stream | ~EUR 130 | 3.7B | EUR 0.48 | C+ |
| Strix Halo | 70B single-stream | ~EUR 90 | ~13M | EUR 0.50 | B/B+ |
All costs in EUR. RTX Pro 6000 at ~$8,500 = ~EUR 7,800 at April 2026 exchange rates. 3-year amortization includes hardware + electricity at EUR 0.15/kWh.
At 100% utilization, self-hosted 8B on an RTX Pro 6000 matches Mistral Nemo API pricing (EUR 0.034/MTok) while running the model locally. Self-hosted 70B at EUR 0.14/MTok undercuts DeepSeek V3.2 API (EUR 0.39/MTok) for comparable B+ output.
The utilization trap
Nobody runs a single-operator GPU at 100%. You ask questions during working hours. The GPU idles overnight, on weekends, during meetings. Realistic utilization for one person or a small team: 10-30%.
| Utilization | RTX Pro 6000 70B EUR/MTok | vs DeepSeek V3.2 API (EUR 0.39) |
|---|---|---|
| 100% | EUR 0.14 | Self-hosted 2.8x cheaper |
| 50% | EUR 0.28 | Self-hosted 1.4x cheaper |
| 20% | EUR 0.70 | API 1.8x cheaper |
| 10% | EUR 1.40 | API 3.6x cheaper |
At 20% utilization, self-hosted 70B costs EUR 0.70/MTok for B+ quality output. The same money buys Gemini 2.5 Flash at B+ quality via API, with no hardware to maintain and 1M token context.
The GPU does not know you went to lunch. It depreciates whether it generates tokens or not.
Hardware depreciation
The RTX Pro 6000 costs ~EUR 7,800 today. When NVIDIA ships Rubin (next-generation architecture, expected ~2027), the Pro 6000 will be worth roughly EUR 3,700-4,600. Three years out: EUR 2,300-2,800. GPU depreciation runs 40-60% over 2-3 years.
API subscriptions depreciate at 0%. Every month brings the latest models. No replacement planning, no resale.
The lease option (EUR 500/month for 18 months, EUR 9,000 total) costs more than buying outright.
Quantization
Quantization compresses model weights to use less memory and run faster, at some quality cost.
70B model
| Quantization | Bits/weight | 70B model size | Quality vs FP16 | Speed vs FP16 | Minimum GPU |
|---|---|---|---|---|---|
| FP16 (full precision) | 16.0 | ~140 GB | 100% | 1.0x (baseline) | 2x H100 SXM |
| Q8_0 | 8.0 | ~70 GB | ~99% | ~1.8-2.0x | 1x RTX Pro 6000 |
| Q6_K | 6.5 | ~54 GB | ~99% | ~2.3-2.5x | 1x RTX Pro 6000 |
| Q5_K_M | 5.5 | ~48 GB | ~96-97% | ~2.5-3.0x | 1x RTX Pro 6000 |
| Q4_K_M (sweet spot) | 4.8 | ~40 GB | ~92-95% | ~3.0-3.5x | 1x RTX Pro 6000 |
| Q3_K_M | 3.4 | ~33 GB | ~85-90% | ~3.5-4.0x | 2x RTX 3090 |
| Q3_K_S | 3.0 | ~28 GB | ~82-88% | ~4.0-4.5x | 2x RTX 4060 Ti 16GB |
| Q2_K (emergency) | 2.7 | ~27 GB | ~75-85% | ~4.0-5.0x | 1x RTX 5090 |
| AQLM 2-bit (trained codebooks) | 2.0 | ~18 GB | ~95-98% | GPU-only | 1x RTX 4090 |
| BitNet 1.58-bit (hypothetical) | 1.58 | ~14 GB | unknown at 70B | CPU-native | 1x RTX 3060 (no model exists) |
8B model (Llama 3.1 8B)
| Quantization | Bits/weight | 8B model size | Quality vs FP16 | Minimum GPU |
|---|---|---|---|---|
| FP16 | 16.0 | ~16 GB | 100% | 1x RTX 4090 (24 GB) |
| Q8_0 | 8.0 | ~8.5 GB | ~99% | 1x RX 5700 XT (8 GB, tight) |
| Q6_K | 6.5 | ~6.3 GB | ~99% | 1x RX 5700 XT (8 GB) |
| Q5_K_M | 5.5 | ~5.7 GB | ~98-99% | 1x RX 5700 XT (8 GB) |
| Q4_K_M (sweet spot) | 4.8 | ~4.7 GB | ~92-95% | 1x GTX 1060 6GB |
| Q3_K_M | 3.4 | ~3.7 GB | ~85-90% | 1x GTX 1050 Ti (4 GB, tight) |
| Q2_K | 2.7 | ~3.1 GB | ~75-85% | 1x GTX 1050 Ti (4 GB) |
Gemma 4 26B-A4B (MoE)
Mixture-of-experts models store all 26B parameters but only activate 4B per token. Storage follows total parameter count; inference speed follows active parameter count.
| Quantization | Total size (26B stored) | Active per token (4B) | Minimum GPU | Effective speed |
|---|---|---|---|---|
| FP16 | ~52 GB | ~8 GB | 1x RTX Pro 6000 | Baseline |
| Q8_0 | ~26 GB | ~4 GB | 1x RTX 5090 (32 GB) | ~2x |
| Q4_K_M | ~16 GB | ~2.4 GB | 1x RTX 4060 Ti 16GB | ~3-3.5x |
| Q2_K | ~8 GB | ~1.3 GB | 1x RX 5700 XT (8 GB) | ~4-5x, quality loss |
At Q4_K_M, the full Gemma 4 MoE fits on a 16 GB card with room for KV cache, while only reading ~2.4 GB of active weights per token. That makes it extremely fast: the RTX Pro 6000 hits 252 t/s on a similar MoE architecture (Qwen3 30B-A3B).
Reading the tables
Q4_K_M is the standard tradeoff: 75% size reduction with 92-95% quality retention. Nearly every local LLM benchmark uses Q4_K_M as default.
Q6_K and Q5_K_M sit between Q8 and Q4 — negligible quality loss but smaller than Q8. Useful when you have the VRAM to spare but not enough for Q8.
Going below Q4 hurts. Q3_K_M is usable but noticeably worse on complex reasoning. Q2_K loses 15-25% of model quality. At that point the model is fighting quantization artifacts on top of its limitations versus frontier models. A 70B at Q2 produces worse output than a 32B at Q4.
AQLM 2-bit is the exception: trained codebooks retain far more quality than naive Q2_K at the same bit width. The tradeoff is GPU-only inference and limited model availability.
Extreme quantization: ternary weights and sub-2-bit
All of the above methods are post-training quantization, compressing an FP16 model after it has been trained. A different approach trains the model with quantized weights from the start.
BitNet b1.58 (Microsoft Research / Tsinghua University) constrains every weight to one of three values: {-1, 0, +1}. That is log2(3) = 1.58 bits per weight. Multiplications become additions. A 2B parameter model fits in 0.4 GB instead of ~4 GB at FP16, roughly 10x smaller.
The only production-quality open model is bitnet-b1.58-2B-4T (2 billion parameters, 4 trillion training tokens, Apache 2.0 license). Community models exist at 8B but are experimental.
| Model | MMLU (5-shot) | Size | Speed (i7-13800H) | EUR/MTok (15W laptop) |
|---|---|---|---|---|
| BitNet b1.58 2B | 53.17 | 0.4 GB | ~34 t/s | EUR 0.006 |
| Llama 3.2 1B (FP16) | 49.30 | ~2.5 GB | ~21 t/s | EUR 0.010 |
| Qwen2.5 1.5B (FP16) | 60.25 | ~3 GB | ~15 t/s | EUR 0.014 |
The energy savings are real: 71-82% less energy than FP16 on x86, 55-70% on ARM. These numbers come from Microsoft's bitnet.cpp paper and have been verified across multiple sources. The inference framework (bitnet.cpp) is production-ready on CPU, with GPU support added in May 2025.
There is one catch that matters for anyone evaluating this: speed and energy benefits only appear when using bitnet.cpp. Loading the model in standard HuggingFace transformers gives you the smaller memory footprint but none of the ternary kernel speedups. You cannot use Ollama, LM Studio, or any other llama.cpp frontend for native BitNet inference as of April 2026.
Post-training methods can also push below 2 bits. AQLM (Yandex/IST Austria, ICML 2024) compresses Llama 2 7B to 2.02 bits with a WikiText2 perplexity of 6.59, versus 5.47 at FP16, roughly 2-5% task quality degradation. QTIP (NeurIPS 2024 Spotlight) achieves similar quality at 2-bit with >3x inference speedup. These are GPU-only methods: they reduce VRAM usage, not power draw.
Does extreme quantization change the economics?
At first glance, 10x smaller models should flip the self-hosting math. A 2B model in 0.4 GB running on a 15W laptop at 25 tokens per second costs about EUR 0.006 per million tokens in electricity, 6x cheaper than Mistral Nemo API.
The problem is quality. BitNet 2B scores MMLU 53. Mistral Nemo 12B scores around 70. These are not the same tier of model. A 2B model, regardless of quantization method, cannot follow complex instructions, write useful code, or maintain coherence over long conversations. Saving EUR 0.03 per million tokens while getting dramatically worse output is not a savings.
The real promise is running larger models on cheaper hardware. A hypothetical 70B BitNet model would need ~18 GB instead of 140 GB, fitting on a single RTX 3090. But the model zoo has not caught up to the technique. Microsoft's verified model is 2B. PrismML released Bonsai 8B in March 2026, claiming 1.15 GB footprint and 368 t/s on an RTX 4090, but these are vendor claims and independent benchmarks are still sparse. Until larger ternary models are independently verified, extreme quantization changes what is theoretically possible without changing what you can actually run today.
Software
Running a model locally in 2026 is straightforward. The tooling is good.
Ollama wraps llama.cpp behind a REST API with model management. Install, pull a model, start chatting.
llama.cpp is the foundation. CUDA, Vulkan, Metal, CPU with AVX-512/AMX. GGUF quantization format.
ik_llama.cpp is an optimized fork of llama.cpp with rewritten CPU kernels. 1.8-5.2x faster prompt processing on AVX2/AVX-512 CPUs. Same GGUF models, drop-in replacement. Use this instead of mainline llama.cpp for CPU-bound workloads. Benchmarks in the CPU comparison section.
vLLM handles concurrent serving with PagedAttention and continuous batching. Use this when serving multiple users from one GPU.
bitnet.cpp is required for native ternary inference. Separate from llama.cpp. CPU-primary, with CUDA support since May 2025. Requires Clang 18+ to build.
The cheapest hardware myth
A common misconception: "I already own the hardware, so inference is free." It is not. Electricity costs money, and consumer GPUs draw 230-415W under inference load. Even ignoring the purchase price entirely, the electricity to generate tokens locally costs more per token than calling a budget API.
Proof: idle desktop CPU
Take an Intel i5-7500T in an HP ProDesk 400 G3 Desktop Mini. A machine that costs EUR 150 used and sits idle most of the day. DDR4-2400 dual-channel gives ~38 GB/s theoretical memory bandwidth. At 50% efficiency with 4 threads, that translates to roughly 4-5 tokens per second on an 8B Q4 model.
The system draws about 22W idle and 45W under inference load. The marginal 23W for inference costs EUR 2.48 per month in electricity at Finnish rates (EUR 0.15/kWh).
| Model | Size on disk | Tokens/sec (est.) | Tokens/month (24/7) | EUR/MTok (electricity only) | vs Mistral Nemo API |
|---|---|---|---|---|---|
| Qwen3 0.6B Q4 | ~0.4 GB | ~40 | 103.7M | EUR 0.024 | 1.5x cheaper, but trivial tasks only |
| BitNet b1.58 2B (ternary) | 0.4 GB | ~15-25 | 51.8M | EUR 0.048 | 1.3x more expensive, D-tier quality |
| TinyLlama 1.1B Q4 | ~0.7 GB | ~28 | 72.6M | EUR 0.034 | Break-even, basic completion only |
| Bonsai 8B (1-bit, unverified) | 1.15 GB | ~15 | 38.9M | EUR 0.064 | 1.7x more expensive, claims 8B quality |
| Phi-4 Mini 3.8B Q4 | ~2.2 GB | ~7.6 | 19.7M | EUR 0.126 | 3.4x more expensive |
| Llama 3.1 8B Q4 | ~4.7 GB | ~4-5 | 11.7M | EUR 0.212 | 5.7x more expensive |
BitNet b1.58 2B runs via bitnet.cpp (not llama.cpp). The ternary weights reduce memory reads, but the i5-7500T's 4 cores with AVX2-only still cap throughput around 15-25 t/s. Bonsai 8B speed is estimated from memory bandwidth; PrismML's claimed 368 t/s is on an RTX 4090 GPU. Independent CPU benchmarks for Bonsai are sparse as of April 2026.
At the same C+ quality tier (8B model), the API costs EUR 0.037 per million tokens. The i5-7500T costs EUR 0.212. The API is 5.7 times cheaper on electricity alone, before counting the computer's purchase price.
The extreme quantization models do not change the math. BitNet 2B at EUR 0.048/MTok is 1.3x more expensive than the API and produces D-tier output (MMLU 53 vs Mistral Nemo's ~70). The only models where electricity beats the API are sub-1B parameter models at Q4, and their output quality is so far below Mistral Nemo that the comparison is meaningless.
CPU comparison: DDR4 to DDR5
The i5-7500T is the cheapest case. How do faster CPUs compare? Memory bandwidth is the bottleneck — more channels, faster memory, more tokens per second.
| CPU | Memory | Bandwidth (theoretical) | Cores | System idle/load | Llama 3.1 8B Q4 t/s (est.) | EUR/MTok (marginal electricity) |
|---|---|---|---|---|---|---|
| Intel N100 | DDR5-4800 1-ch | 38.4 GB/s | 4E | 8W / 16W | ~3 | EUR 0.111 |
| Intel i3-N305 | DDR5-4800 1-ch | 38.4 GB/s | 8E | 10W / 22W | ~3.5 | EUR 0.143 |
| Intel i5-7500T | DDR4-2400 2-ch | 38.4 GB/s | 4C/4T | 22W / 45W | ~4-5 | EUR 0.212 |
| Intel i5-8500T | DDR4-2666 2-ch | 42.7 GB/s | 6C/6T | 22W / 48W | ~5-6 | EUR 0.180 |
| AMD Ryzen 5 5600X | DDR4-3200 2-ch | 51.2 GB/s | 6C/12T | 45W / 95W | ~6-7 | EUR 0.297 |
| AMD Ryzen 9 5900X | DDR4-3200 2-ch | 51.2 GB/s | 12C/24T | 55W / 120W | ~6-8 | EUR 0.338 |
| AMD Ryzen 9 7950X | DDR5-5200 2-ch | 83.2 GB/s | 16C/32T | 65W / 145W | ~10-12 | EUR 0.277 |
Electricity rate: EUR 0.15/kWh. EUR/MTok calculated from marginal power (load minus idle) at estimated token rate, 24/7 operation. N100 and i3-N305 are single-channel memory, which caps bandwidth despite DDR5 speeds. The Ryzen 7950X is 2-3x faster than the i5-7500T thanks to DDR5 dual-channel (83 GB/s vs 38 GB/s).
Every CPU in this table costs 3x to 9x more per token in electricity than the Mistral Nemo API (EUR 0.037/MTok). The fastest desktop CPU tested (Ryzen 7950X) still costs 7.5x more. More cores do not help — dual-channel DDR4 or DDR5 memory bandwidth is the wall.
One exception to the "llama.cpp is llama.cpp" assumption: ik_llama.cpp, an optimized fork, achieves 1.8-5.2x faster prompt processing than mainline llama.cpp on the same hardware. Benchmarks on a Ryzen 7950X (16 threads, AVX2):
| Quantization | ik_llama.cpp (t/s) | llama.cpp (t/s) | Speedup |
|---|---|---|---|
| BF16 | 256.9 | 78.6 | 3.3x |
| Q8_0 | 268.2 | 147.9 | 1.8x |
| Q4_0 | 273.5 | 153.5 | 1.8x |
| IQ3_S | 156.5 | 30.2 | 5.2x |
| Q8_K_R8 | 370.1 | N/A | new format |
These are prompt processing (prefill) speeds, not generation. Generation improves a more modest 1.0-2.1x. The key finding: on the Ryzen 7950X, the slowest quantization type in ik_llama.cpp is faster than the fastest type in mainline llama.cpp for prompt processing. For CPU inference workloads that involve large prompts — document summarization, RAG extraction, batch classification — the fork makes a material difference. It does not change the electricity cost comparison (the watts are the same), but it reduces wall-clock time per job, which matters for throughput-limited batch work.
Proof: every consumer GPU loses
The pattern holds across dedicated GPUs. Inference power draw runs about 55-75% of the card's TDP (the decode phase is memory-bandwidth-bound, not compute-bound). Add 120W for the rest of the system.
| GPU | VRAM | Llama 3.1 8B Q4 tok/s | System watts | EUR/MTok (electricity) | vs API |
|---|---|---|---|---|---|
| RX 5700 XT | 8 GB | ~55 (Vulkan, no ROCm) | 266W | EUR 0.201 | 5.4x |
| RX 6700 XT | 12 GB | ~45 | 270W | EUR 0.250 | 6.8x |
| RX 6900 XT | 16 GB | ~60 | 315W | EUR 0.219 | 5.9x |
| RTX 3060 | 12 GB | ~42 | 230W | EUR 0.228 | 6.2x |
| RTX 3090 | 24 GB | ~87 | 345W | EUR 0.165 | 4.5x |
| RTX 4060 Ti 16GB | 16 GB | ~48 | 227W | EUR 0.197 | 5.3x |
| RTX 4090 | 24 GB | ~128 | 413W | EUR 0.134 | 3.6x |
| RX 7900 XTX | 24 GB | ~80 | 351W | EUR 0.183 | 4.9x |
API baseline: Mistral Nemo at EUR 0.037/MTok ($0.04/MTok output). Electricity rate: EUR 0.15/kWh. RX 5700 XT runs Vulkan only (ROCm dropped Navi 10 support); speed scaled from llama.cpp Vulkan scoreboard data.
Every consumer GPU costs 3.6 to 6.8 times more in electricity per token than the API. The RTX 4090, the fastest consumer card, still costs 3.6x more. At German electricity rates (EUR 0.30/kWh), it costs 7.2x more.
The ex-mining AMD cards are fast for their price but still lose to the API. The RX 5700 XT draws 266W system power for ~55 tokens per second — its wide 256-bit memory bus (448 GB/s) makes it the fastest sub-$200 used card. The RX 6900 XT draws 315W for ~60 t/s: 18% more power for 9% more speed over the 5700 XT, because both share a 256-bit bus and differ mainly in clock speed.
For the best consumer GPU (RTX 4090) to break even with Mistral Nemo API on electricity alone, the electricity rate would need to be EUR 0.041/kWh. That is below any residential rate in Europe.
Enterprise GPUs: faster but still losing
Enterprise GPUs have HBM memory with 2-4x the bandwidth of consumer GDDR. They are faster per token but draw more power and sit in servers with higher base consumption. System power assumes +200W for a server chassis (vs +120W for a desktop).
| GPU | VRAM | Bandwidth | 8B Q4 tok/s (est.) | System watts | EUR/MTok (electricity) | vs API | Purchase price (Apr 2026 est.) |
|---|---|---|---|---|---|---|---|
| L40S | 48 GB GDDR6 | 864 GB/s | ~110 | 550W | EUR 0.208 | 5.6x | $7,500-9,000 |
| A100 40GB PCIe | 40 GB HBM2e | 1,555 GB/s | ~195 | 450W | EUR 0.096 | 2.6x | $5,000-8,000 (used) |
| A100 80GB SXM | 80 GB HBM2e | 2,039 GB/s | ~260 | 600W | EUR 0.096 | 2.6x | $12,000-18,000 |
| RTX Pro 6000 | 96 GB GDDR7 | 1,792 GB/s | ~279 (7B bench) | 720W | EUR 0.107 | 2.9x | $8,000-9,200 |
| H100 SXM | 80 GB HBM3 | 3,352 GB/s | ~400 | 900W | EUR 0.094 | 2.5x | $22,000-30,000 |
The H100, the fastest GPU in this comparison, still costs 2.5x more per token in electricity than the API. The L40S is worse than consumer GPUs per token because its 864 GB/s bandwidth is slower than the RTX 4090's 1,008 GB/s but draws more system power in a server chassis.
Enterprise GPUs close the gap but do not flip the math. Where they earn their price is batched serving: an H100 running vLLM with 32 concurrent users achieves aggregate throughput that amortizes the power draw across requests. Single-stream single-user inference, which is what this table measures, is the worst case for expensive hardware.
What this means
Hardware cost, depreciation, driver maintenance, cooling, and noise all come on top of electricity. If electricity alone already loses to the API, the total cost is not getting better.
Self-hosting has real advantages: privacy, zero network latency, offline capability, fine-tuning. Those are legitimate reasons to run local inference. "It is cheaper" is not one of them.
When self-hosting makes sense
If data cannot leave your premises (regulatory, air-gapped, classified), API access is off the table. Self-hosted is the only option and cost comparisons do not apply.
Bulk classification, embedding generation, and document processing with a 7-8B model on modest hardware is cheap and fast. A used RTX 3090 ($800-900) runs Llama 3.1 8B at 87-112 t/s depending on configuration. These tasks do not need frontier quality.
Genuine 24/7 batch workloads keeping a GPU above 80% utilization get real per-token savings. This applies to organizations running inference as a service, not individuals using LLMs as assistants.
Fine-tuning requires local hardware. You cannot fine-tune through a subscription. LoRA adapters for an 8B model train on a single RTX 3090 in hours. Full fine-tuning of a 70B model needs 2-4x 80 GB GPUs. If your use case requires a custom model trained on proprietary data, self-hosting is not optional.
Needle-in-a-haystack retrieval and semantic search over large private document collections are where local inference earns its cost. RAG pipelines that embed and search hundreds of thousands of documents generate millions of tokens per day in embedding and extraction queries. These are high-volume, low-quality-bar tasks where an 8B model is sufficient and API costs would compound. A single GPU running embeddings at 80-100% utilization processes the volume at a fraction of API pricing because the per-token overhead is amortized over sustained throughput.
Embedding models are often the strongest case for local deployment. Embedding models (22M-334M parameters) are distinct from generative LLMs — they produce vector representations for semantic search and RAG, not text. They run on any CPU at hundreds to thousands of embeddings per second with negligible power draw and 1-2 GB RAM total.
| Model | Parameters | Use case | Speed (CPU) | RAM |
|---|---|---|---|---|
| all-minilm-l6-v2 | 22M | fast filtering, similarity | ~5,000 embed/sec | <1 GB |
| nomic-embed-text | 274M | semantic search, RAG | ~1,000 embed/sec | ~1 GB |
| mxbai-embed-large | 334M | high-quality embeddings | ~500 embed/sec | ~1.5 GB |
A 22M embedding model running on any desktop CPU powers local semantic search over thousands of documents with zero API cost. For infrastructure use cases — log analysis, document similarity, RAG retrieval — embedding models deliver more production value than 1-4B generative models. They are the one category where "I already own the hardware" is genuinely true: the compute is trivial, the electricity is negligible, and the privacy benefit is real. API-based embedding (OpenAI ada-002 at $0.10/MTok) adds up fast at scale; local embedding is effectively free after the one-time setup.
Document summarization at scale follows the same pattern. Summarizing 10,000 PDFs through an API costs real money at $0.04-3.00 per million tokens depending on the model tier. On local hardware, the marginal cost is electricity and the model runs as fast as bandwidth allows. A Ryzen 7950X with 64 GB RAM can summarize documents at ~10 t/s on an 8B model continuously without per-request billing. An RTX 3090 does the same at ~87 t/s.
When APIs win
For coding, multi-step reasoning, and agentic workflows, APIs win on both quality and effective cost.
A B+ local model looks adequate on simple prompts. The gap shows on hard problems: multi-file refactoring, agent coherence past 20 turns, judgment calls in ambiguous situations. The output compiles. It does the wrong thing. These are the tasks where LLMs are worth using, and where frontier models justify their pricing.
At 20-30% utilization, APIs often cost less per token than self-hosted inference. Add maintenance (driver bugs, cooling, kernel compatibility, GPU replacement) and the gap widens.
The DeepSeek V3 case study
DeepSeek V3 API costs about EUR 260/month for a heavy workload (130K queries/month at 74K token context). Self-hosting the same model requires 8x H100 SXM GPUs.
| Configuration | Upfront cost | Monthly operating | vs API (EUR 260/mo) |
|---|---|---|---|
| DeepSeek V3 on 8x H100 (new) | EUR 320K | EUR 9,600 | 37x more |
| DeepSeek V3 on 8x H100 (used) | EUR 200K | EUR 6,260 | 24x more |
| DeepSeek V3 on 2x MI300X | EUR 37K | EUR 1,230 | 5x more |
| Qwen3-30B-A3B on 2x RTX 4090 (substitute) | EUR 7,600 | EUR 350 | 35% more (lower quality) |
Even the cheapest alternative that might work — a smaller model on consumer GPUs — costs more per month than the API. Break-even for the Qwen3-30B-A3B option: EUR 7,600 hardware / EUR 105 monthly savings on electricity = 72 months (6 years). With 3-year hardware amortization included, break-even is never reached.
The API is almost certainly priced below marginal cost for individual users. DeepSeek benefits from massive scale, strategic pricing, and Chinese electricity rates. Competing on price with a subsidized API by buying your own hardware is a losing proposition at any scale short of running your own inference service.
API pricing is variable: pay for what you use. GPU hardware is a fixed cost that depreciates regardless. Workloads that vary week to week waste less money on APIs.
At Pulsed Media
Pulsed Media runs its own hardware in owned datacenter space in Finland. GPU inference cards like the RTX Pro 6000 go through the same depreciation, power cost, and rack space analysis as any other piece of infrastructure.
For most workloads in 2026, cloud APIs deliver better quality per euro than local inference. Bulk classification, embeddings, and privacy-constrained tasks are where the GPU math works. 16 years of hardware purchasing says: buy when the math supports it, use the API when it does not.
Seedboxes and dedicated servers from Pulsed Media run on the same Finnish datacenter infrastructure, with the same cost-per-unit discipline applied to every piece of hardware. See current plans.
References
Benchmarks and performance data
- LMSYS Chatbot Arena ELO Leaderboard — model quality rankings used for Arena ELO scores throughout this article
- llama.cpp Vulkan GPU Scoreboard — community-submitted GPU inference benchmarks (Llama 2 7B Q4_0, tg128). Source for RX 5700 XT (70.73 t/s), GTX 1050 Ti (20.96 t/s), RTX 3060 (75.94-80.59 t/s), and other consumer GPU speeds
- llama.cpp — the inference engine behind Ollama and most local LLM benchmarks
- vLLM — batched serving engine; source for Pro 6000 throughput numbers (8B at 8,990 t/s, 70B at 1,031 t/s)
Model cards and technical reports
- Gemma 4 model family — 31B dense and 26B-A4B MoE variants. Arena ELO ~1450 (31B) and ~1441 (26B-A4B)
- Llama 3.3 70B — the 70B dense model used for Pro 6000 and consumer GPU benchmarks
- Llama 3.1 8B — the 8B model used as baseline for electricity cost and bandwidth comparisons
- Qwen3 model family — 0.6B through 235B; 30B-A3B MoE benchmark source
- Phi-4 Mini 3.8B model card — MMLU 67.3%, GSM8K 88.6%
- BitNet b1.58 2B-4T model card — MMLU 53.2%, energy and latency benchmarks
- bitnet.cpp — native ternary inference framework
Research papers
- bitnet.cpp: Efficient Edge Inference for Ternary LLMs (Microsoft Research, 2024) — energy per token and CPU latency measurements for BitNet models
- AQLM: Extreme Compression of LLMs via Additive Quantization (ICML 2024) — 2-bit quantization with 2-5% quality degradation
- QTIP: Quantization with Trellises and Incoherence Processing (NeurIPS 2024 Spotlight) — 2-bit quantization with 3x+ inference speedup
- Chroma 2025 study — context rot: 20-50% accuracy degradation from 10K to 100K tokens across 18 frontier models
- Liu et al. 2024 — "lost in the middle" effect: systematic attention bias toward beginning and end of context
- Sequential-NIAH (EMNLP 2025, arXiv 2504.04713) — best model scored 63.5% on ordered extraction across long context
- NoLiMa (ICML 2025) — non-literal long-context evaluation: semantic retrieval degrades worse than passkey benchmarks suggest
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Google Research, ICLR 2026) — KV cache compression to 2.5-3.5 bits with near-zero quality loss
CPU inference and memory bandwidth
- AMD EPYC 9554 llama.cpp benchmarks (ahelpme.com) — EPYC 9554 achieves 50 t/s on 8B Q4_K_M, confirming memory bandwidth as primary bottleneck
- OpenBenchmarking.org llama.cpp results — 88 public CPU benchmark results, median 14.2 t/s on 8B Q8_0
- ik_llama.cpp CPU performance — Ryzen 7950X benchmarks showing 1.8-5.2x speedup over mainline llama.cpp for prompt processing
- ik_llama.cpp — optimized llama.cpp fork with improved CPU kernels
- CPU token rate formula: theoretical max t/s ≈ memory bandwidth (GB/s) / model size (GB); actual ~50-70% of theoretical due to overhead
Hardware specifications
- Speeds in this article use Q4_K_M quantization unless otherwise noted. Consumer GPU system power includes GPU + ~120W for the rest of the system; enterprise GPUs use +200W for server chassis. Electricity rate: EUR 0.15/kWh (Finnish residential average). Enterprise GPU purchase prices are estimated April 2026 market rates and shift rapidly
- VRAM context limits use Llama 3.1 8B KV cache parameters (128 MB/1K tokens at FP16, 64 MB/1K at Q8, 32 MB/1K at Q4). Other model architectures vary; Qwen3 8B uses ~144 MB/1K and Gemma 4 31B dense uses ~850 MB/1K due to its 16 KV-head architecture
See also
- NVIDIA RTX Pro 6000 (Blackwell) — full specs, benchmarks, and known issues for the 96 GB GPU referenced throughout this article
- Seedbox — PM's primary product, where the same hardware cost economics apply
- NVMe — storage interface used alongside GPU inference servers
- RAID — storage redundancy in PM's infrastructure