Jump to content

Self-Hosting LLMs vs API

From Pulsed Media Wiki


Pulsed Media owns its datacenter space and runs its own hardware. GPU inference cards get evaluated against the same cost-per-unit math that prices seedboxes and dedicated servers. In April 2026, the math says: cloud APIs win for quality-sensitive work, self-hosted GPUs win for bulk processing and privacy-constrained workloads.

This article has the hardware benchmarks, API pricing, and economics to show why.

The quality problem

The first obstacle is not cost. It is quality.

The best open-weight model you can run on a single consumer GPU (24 GB) is Gemma 4 31B. It scores approximately 1450 on the LMSYS Chatbot Arena ELO scale (arena.ai, preliminary, April 2026). That is 20 points below Sonnet and 50 below Opus. The gap sounds small until you measure it on hard tasks.

Model Arena ELO (approx) Access Hardware needed
Claude Opus 4.6 ~1500 API / subscription Cloud
Claude Sonnet 4.6 ~1470 API / subscription Cloud
GLM-5 744B (dense) ~1451 Open weights 6x 96 GB GPUs (~$48,000)
Gemma 4 31B (dense) ~1450 Open weights 1x 24 GB GPU at Q4 (~$1,600)
Kimi K2.5 1T (MoE) ~1447 Open weights 8x 96 GB GPUs (~$64,000)
Gemma 4 26B-A4B (MoE) ~1441 Open weights 1x 24 GB GPU at Q4 (~$1,600)
Qwen3-235B-A22B (MoE) ~1422 Open weights 2x 96 GB GPUs (~$16,000)
DeepSeek V3.2 685B (MoE, 37B active) ~1421 Open weights 5x 96 GB GPUs (~$40,000)
DeepSeek R1 671B (MoE, reasoning) ~1398 Open weights / API 5x 96 GB GPUs (~$40,000)
Llama 3.3 70B (dense) ~1350-1400 Open weights 1x 96 GB GPU (~$8,500)

The table splits into two tiers. Models you can run on a single GPU ($1,600-8,500) top out at ELO ~1450. Models that approach Sonnet territory (GLM-5 at 1451, Kimi K2.5 at 1447) require multi-GPU setups. The costs shown assume RTX Pro 6000 cards at ~$8,000 each via PCIe — but without NVLink, tensor parallelism runs at roughly one-third the throughput of datacenter H100s. An NVLink-equipped H100 SXM configuration running the same models costs 3-5x more ($150,000-320,000).

Gemma 4 31B scores 80.0% on LiveCodeBench v6 and 84.3% on GPQA Diamond. These are strong numbers for an open model. But on agentic workflows where models must maintain coherence across 20+ tool calls, open models fail silently: the output compiles but does the wrong thing, or the agent loops instead of converging. These failures are expensive because nobody notices until the result is wrong.

No amount of hardware spending changes this. The quality ceiling is in the model weights. Even $64,000 in GPUs running Kimi K2.5 does not reach Sonnet.

Quality vs hardware: what each GPU tier can actually run

The quality gap depends on what fits on the hardware you own.

VRAM Best model (Q4) Arena ELO Quality tier Gap to Sonnet
4 GB Phi-4 Mini 3.8B or smaller ~1100-1200 D ~270+ ELO
8 GB Llama 3.1 8B (tight) ~1250-1300 C/C+ ~170-220 ELO
12 GB Qwen3 14B ~1350 B- ~120 ELO
16 GB Qwen3 14B (comfortable) ~1350 B- ~120 ELO
24 GB Gemma 4 31B ~1450 B+/A- ~20 ELO
32 GB Gemma 4 31B at Q8 ~1450+ A- ~20 ELO
48 GB Llama 3.3 70B Q4 ~1350-1400 B/B+ ~70-120 ELO
96 GB Llama 3.3 70B Q8 ~1400 B+ ~70 ELO

At 4-8 GB VRAM, you are running models that struggle with multi-step reasoning and produce noticeably worse output than a $0.04/MTok API call to Mistral Nemo. At 24 GB, Gemma 4 closes the ELO gap but still falls short on the hardest tasks. Only at 96 GB do you get a 70B model at full quality, and even that is still B+ tier versus API frontier at A+/S.

What 4-8 GB VRAM can run

Most consumer GPUs sold in the last five years have 4-8 GB of VRAM. Can they do anything useful with local LLMs?

4 GB (GTX 1050 Ti, RX 570, integrated graphics)

At 4 GB, the only models that fit are sub-3B at Q4:

Model Size at Q4 Speed (est.) Useful for
Qwen3 0.6B ~0.5 GB 30-50 t/s Text classification, simple extraction
Llama 3.2 1B ~0.8 GB 20-30 t/s Basic summarization, translation
TinyLlama 1.1B ~0.7 GB 25-35 t/s Research, experimentation
BitNet b1.58 2B 0.4 GB 10-20 t/s (CPU) Research only
Phi-4 Mini 3.8B ~2.3 GB 15-25 t/s Simple coding, Q&A

These models cannot write useful code, follow complex instructions, or maintain coherence past a few exchanges. A 1B model completing a sentence is not the same as a 70B model reasoning about a problem. The quality difference is categorical.

8 GB (RTX 3060 8GB, RTX 4060, RX 6600)

At 8 GB, a 7B/8B model at Q4 fits with room for ~32K context (Q8 KV cache):

Model Size at Q4 Speed (GPU) Max context (Q8 KV) Quality
Llama 3.1 8B ~4.8 GB ~40-50 t/s ~32K C+
Qwen3 8B ~5.2 GB ~35-45 t/s ~28K C+
Gemma 4 E4B ~5 GB ~35-45 t/s ~28K C
Mistral Nemo 12B ~7.4 GB ~25-35 t/s ~4-8K C+

An 8B model on an 8 GB GPU is the minimum setup that produces output comparable to budget APIs. The constraint is context: at Q4 KV cache, 64K tokens are feasible but tight. At Q8 KV (recommended for retrieval tasks), the ceiling is ~32K.

Mistral Nemo 12B fits at Q4 but leaves almost no room for KV cache. Short conversations only.

Memory bandwidth is the bottleneck

LLM token generation is memory-bandwidth-bound. The GPU reads model weights from VRAM for every single token. Read speed determines generation speed.

Platform Memory bandwidth Llama 3.3 70B Q4 tok/s Relative speed
RTX Pro 6000 1,792 GB/s ~34 1.0x
H100 SXM 3,352 GB/s ~40 1.2x
Strix Halo (128 GB unified) 215 GB/s ~4.5 0.13x
DDR5 desktop (dual channel) ~89 GB/s ~2.2 0.06x
DDR4-3200 desktop (dual channel) ~42 GB/s ~1 0.03x

The Pro 6000 and H100 are close on single-card workloads because both have enough bandwidth to keep a 70B model fed. The Pro 6000 compensates with more aggressive quantization (Q4 vs FP8), fewer bytes per weight, similar throughput. The H100 pulls ahead with NVLink multi-GPU tensor parallelism, where its 900 GB/s interconnect leaves PCIe (128 GB/s) behind.

A Strix Halo system has 215 GB/s. Same model, same quantization, 8.3x less bandwidth, roughly 8x slower. A desktop CPU on DDR5 dual-channel has ~89 GB/s. 20x slower than the Pro 6000.

No amount of CPU cores or compiler flags changes this. Bandwidth is physics. The cheapest path to 1,792 GB/s in April 2026 is the RTX Pro 6000 at $8,500.

The rough formula: max tokens/sec = Memory Bandwidth (GB/s) / Model Size (GB). Actual throughput is about 50-70% of theoretical. An EPYC 9554 with ~500 GB/s measured bandwidth hits 50 t/s on an 8B Q4 model, which is 55% of the theoretical 111 t/s.

Hardware comparison

Professional GPUs

The RTX Pro 6000 is the only GPU under $10,000 that runs 70B models on a single card without quantizing below Q4. Full specs and known issues are on the dedicated page.

Model Quantization Tokens/sec (generation)
Llama 3.1 8B Q4_K_M ~185
Mistral Nemo 12B Q4_K_M 163
Qwen3 30B-A3B (MoE) Q4_K_M 252
Llama 3.3 70B Q4_K_M 34

34 tokens per second on 70B is about 25 words per second. Fast enough that you are not waiting for the model to finish a paragraph.

With vLLM batched serving (concurrent requests), throughput goes well beyond single-stream: 8B at 8,990 t/s, 70B at 1,031 t/s. On models that fit on one card, the Pro 6000 matches the H100 SXM at one-third the cost.

The card has real problems. A virtualization reset bug causes unrecoverable GPU state after VM shutdown. Sustained vLLM inference triggers chip resets at temperatures as low as 28C. The SM120 kernel architecture breaks DeepSeek models. Only the open-source driver works on Blackwell; there is no proprietary option.

Consumer GPUs

GPU VRAM Bandwidth Llama 3.1 8B Q4 t/s Max model (Q4) Price (Apr 2026)
RTX 5090 32 GB GDDR7 1,790 GB/s ~300 ~32B $2,900-4,200
RTX 4090 24 GB GDDR6X 1,008 GB/s ~128 ~24B $1,600-2,200
RTX 3090 24 GB GDDR6X 936 GB/s ~87-112 ~32B tight $800-900 used
RTX 4000 SFF Ada 20 GB GDDR6 280 GB/s ~64 ~13B ~$1,250

The RTX 5090 has bandwidth matching the Pro 6000 (1,790 GB/s) but only 32 GB of VRAM, which limits it to 32B-class models at Q4. For sub-32B workloads, the 5090 is better price-performance than the Pro 6000.

The RTX 3090 is the last consumer card with NVLink. Two of them ($1,600-1,800 used) give 48 GB combined, enough for 32B models with generous context. For the price of a single RTX 5090, a dual-3090 setup gets similar bandwidth and 50% more VRAM.

Gemma 4 on consumer GPUs: dense vs MoE

Gemma 4 is available in two forms that fit on 24 GB: the 31B dense model and the 26B-A4B MoE. The MoE activates only 3.8B parameters per token, making it dramatically faster.

Variant Arena ELO RTX 4090 speed RTX 3090 speed VRAM (Q4) 256K context?
31B Dense ~1450 ~25-35 t/s (short ctx) ~20-35 t/s 19.6 GB No (KV cache fills 24 GB at ~32-64K)
26B-A4B MoE ~1441 ~50-129 t/s ~35-40 t/s ~15.6 GB Yes (8.4 GB headroom for KV)

The 31B dense model has a disproportionately large KV cache (~0.85 MB per token at BF16) because of its 16 KV-head architecture. Despite fitting at Q4, context expansion rapidly consumes remaining VRAM. At VRAM-saturated configurations, the RTX 4090 drops to 7.8 t/s on the 31B.

The 26B-A4B MoE uses 4x less KV cache (~0.21 MB/token), runs 2-4x faster, and scores within 9 ELO points. For interactive use on 24 GB hardware, the MoE is the correct choice. The dense model is worth choosing only for maximum coding quality at short context (<8K tokens).

Compact workstations

The Minisforum MS-02 Ultra won a CES 2026 Innovation Award. Compact 4.8L chassis, Intel Core Ultra 9 285HX, up to 256 GB DDR5 ECC, four NVMe slots. Looks great on paper.

The catch: the PCIe slot is low-profile dual-slot only. The best GPU that fits is an RTX 4000 SFF Ada with 20 GB VRAM and 280 GB/s bandwidth. That gets ~64 t/s on an 8B model and cannot run anything above 13B. The MS-02 Ultra is a homelab machine, not an inference server.

Strix Halo

AMD's Ryzen AI Max+ 395 (Strix Halo) puts 128 GB of unified LPDDR5x memory on a single chip. The iGPU can address up to 96 GB as VRAM, matching the RTX Pro 6000 on capacity.

Bandwidth tells the real story: 215 GB/s measured, versus the Pro 6000's 1,792 GB/s. Every model runs 4-8x slower. 70B at 4.5 tokens per second works, but it is painfully slow.

Where Strix Halo is useful: running large MoE models that would not fit on a 24-32 GB consumer GPU. Qwen3 30B-A3B at 66-72 t/s in 128 GB unified memory, or larger MoE models at 20+ t/s. The 128 GB pool is the point, not the speed.

At EUR 2,000-3,000 for a complete system consuming 120W, it costs 3-5x less than an RTX Pro 6000 setup. Same model capacity, 8x less speed.

CPU inference

Server-class CPUs with enough memory channels can run LLMs at usable speeds for small models:

CPU Memory bandwidth 8B Q4 t/s 70B Q4 t/s Usable for chat?
EPYC 9554 (64c, 8-ch DDR5) ~500 GB/s ~50 ~7 8B yes, 70B batch only
Dual Xeon Gold 5317 ~80 GB/s ~22 ~3 8B marginal, 70B no
Desktop DDR5 (dual-channel) ~89 GB/s ~20 ~2 8B marginal, 70B no
Desktop DDR4 (i5-7500T) ~38 GB/s ~4-5 Basic completion only

CPU inference makes sense when you need to run a model that does not fit in any available VRAM and buying a GPU is not an option. An EPYC with 12-channel DDR5 and 512+ GB RAM can run 70B at 7 t/s. Batch processing, not interactive.

The optimized fork ik_llama.cpp gets 1.8-5x faster prompt processing than mainline llama.cpp on CPUs with AVX-512, though generation speed gains are more modest.

The smallest useful CPU models

For CPU-only machines without a GPU, the models worth considering are limited:

Model Parameters RAM needed Desktop CPU speed Quality
BitNet b1.58 2B4T 2.4B ~0.4 GB 10-20+ t/s Research-grade; MMLU 53
Llama 3.2 1B 1.3B ~0.8 GB 20-30 t/s Basic tasks only
Llama 3.2 3B 3.2B ~2 GB 10-15 t/s Simple Q&A, summarization
Phi-4 Mini 3.8B 3.8B ~2.3 GB 7-10 t/s Simple coding, reasoning
Llama 3.1 8B 8B ~4.8 GB 4-5 t/s (DDR4) / 15-20 t/s (DDR5) C+ tier; minimum useful

Models under 3B parameters exist primarily for research, experimentation, and edge deployment (phones, IoT). TinyLlama, SmolLM, and similar projects show what sub-3B models can do, but their output quality is far below what budget APIs provide for $0.04/MTok.

The 8B tier on CPU is the minimum for genuinely useful output. Below that, you are trading quality for the ability to run offline. That is a valid tradeoff for privacy-constrained or air-gapped environments, not a cost savings.

The context window problem

Real workloads routinely exceed 128K tokens. A medium codebase is 200K-500K tokens. A 50-page contract is 40K tokens. An agentic session with 30+ tool calls accumulates 100K-300K tokens. Document collection analysis, multi-session memory, and RAG over large corpora push past 500K easily.

Gemini 2.5 Pro handles 1 million tokens natively. Claude Opus 4.6 and Sonnet 4.6 handle 1M. GPT-4.1 takes 1M. These are the context windows where real work happens.

Open models top out at 128K. That is not a VRAM limitation — it is a training limitation. No open model available in April 2026 was trained on context beyond 128K-256K, and quality degrades well before the stated maximum. This is the single largest gap between self-hosted and API inference, and no amount of hardware closes it.

VRAM limits on context

Every token in context consumes VRAM for its KV cache entry. The formula: 2 x layers x KV_heads x head_dim x bytes_per_element. For Llama 3.1 8B (32 layers, 8 KV heads, 128 head dim), that is ~128 MB per 1,000 tokens at FP16. Q8 KV halves that to ~64 MB/1K with negligible quality loss. Q4 KV halves again but degrades retrieval accuracy.

Trained context windows cap the usable range regardless of VRAM:

Model Trained context Notes
Llama 3.1 / 3.3 128K Best open 70B; hard ceiling at 128K
Qwen3 128K YaRN-extended; quality degrades past training range
Gemma 4 128K
Mistral Nemo 128K
Gemini 2.5 Pro (API) 1,000K 8x the best open model
Claude Opus 4.6 / Sonnet 4.6 (API) 1,000K 8x the best open model
GPT-4.1 (API) 1,000K 8x the best open model

Every open model stops at 128K. All three frontier API providers — Anthropic, Google, and OpenAI — now offer 1M token context windows. For workloads that routinely exceed 128K — codebase analysis, long document chains, agent sessions — this gap is not solvable with more VRAM. The model simply was not trained for it.

All context numbers below use Q8 KV cache (recommended — negligible quality loss vs FP16, half the VRAM). Q4 KV doubles these limits but degrades retrieval accuracy. Models at Q4_K_M weights throughout.

VRAM Model Max context (Q8 KV) Max context (Q4 KV) Notes
8 GB 8B ~36K ~73K Minimum useful setup
14B+ Does not fit
12 GB 8B ~112K 128K (trained limit) 128K with Q4 KV only
14B ~43K ~86K
16 GB 8B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
14B ~93K 128K (trained limit)
24 GB 8B 128K (trained limit) 128K (trained limit) Full context
14B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
32B ~35K ~70K Short context only
48 GB 14B 128K (trained limit) 128K (trained limit) Full context
32B 128K (trained limit) 128K (trained limit) Full context at Q8 KV
70B ~46K ~92K Short-to-medium context
80 GB 32B 128K (trained limit) 128K (trained limit) Full context
70B ~125K 128K (trained limit) Full context at Q4 KV; near-full at Q8
96 GB 70B 128K (trained limit) 128K (trained limit) Full context; best single-GPU for 70B
128 GB (Strix Halo / CPU) 70B 128K (trained limit) 128K (trained limit) Full context; ~4.5 t/s (slow)

KV cache per token at Q8: 8B = 64 MB/1K, 14B = 80 MB/1K, 32B = 128 MB/1K, 70B = 160 MB/1K. Q4 halves these. 1 GB headroom reserved for runtime overhead.

For CPU/RAM inference, the same arithmetic applies but "VRAM" becomes "available RAM after OS and model." A 64 GB DDR4 system running 8B Q4 (4.7 GB model + OS overhead) has roughly 55 GB available for KV cache — same capacity as a 48 GB GPU, but 10-15x slower due to DDR4 bandwidth (~42 GB/s vs GPU's 900+ GB/s).

Context rot: accuracy degrades with length

Even when the VRAM fits, model accuracy drops as context grows. The Chroma 2025 study tested 18 models and found 20-50% accuracy degradation from 10K to 100K tokens across every model tested, including frontier APIs. Open models degrade faster.

The worst-case position is the middle of the context window. Beginning and end receive stronger attention (the "lost in the middle" effect, Liu et al. 2024). For retrieval tasks where the target information can appear anywhere, this creates systematic blind spots.

On harder retrieval tasks requiring ordered extraction across long context (Sequential-NIAH, arXiv 2504.04713, EMNLP 2025), the best model tested scored 63.5%. Standard "find the passkey" benchmarks show 90-100% at 128K; the realistic number for semantic retrieval is 50-65%. This is a model capability ceiling, not a hardware limitation.

Non-literal retrieval is worse still. The NoLiMa benchmark (ICML 2025) tested 13 LLMs on tasks requiring semantic similarity matching rather than exact text lookup. Open-source models showed inverted U-shaped performance curves beyond critical context thresholds — accuracy degrades substantially when the task requires reasoning about meaning rather than matching strings.

KV cache quantization compounds these losses. FP16 and Q8 KV cache produce no measurable retrieval degradation. Q4 KV cache adds detectable accuracy loss on top of context rot, particularly on retrieval tasks. For any workload where finding information in context matters, Q8 KV cache is the minimum — Q4 saves VRAM at the cost of making the context less reliable.

TurboQuant: KV cache compression (ICLR 2026)

TurboQuant (Google Research, ICLR 2026) compresses KV cache to 2.5-3.5 bits per channel with near-zero quality loss. This is not weight quantization — it is complementary to Q4_K_M model weights. You can run a model at Q4 weights AND TurboQuant 3-bit KV cache simultaneously.

KV cache method Bits/channel LongBench score Needle retrieval VRAM per 1K tokens (8B model)
FP16 (baseline) 16 50.06 0.997 128 MB
Q8 (current standard) 8 ~50.06 ~0.997 64 MB
TurboQuant 3.5 50.06 0.997 ~28 MB
TurboQuant 2.5 49.44 ~20 MB
Q4 (current) 4 degrades degrades 32 MB
KIVI 3-bit 3 48.50 0.981 ~24 MB

At 3.5 bits, TurboQuant matches FP16 quality exactly — identical LongBench score, identical needle retrieval. At 2.5 bits, it outperforms KIVI at 3 bits. The method is data-oblivious: no calibration data, no per-model tuning. It works by applying a random orthogonal rotation that makes vector coordinates near-independent, then using mathematically optimal scalar quantizers.

If TurboQuant 3-bit KV becomes standard, the VRAM context table above roughly doubles: an 8B model on 24 GB at 3-bit KV would reach 128K comfortably instead of needing Q8 at 128K. A 70B model on 96 GB at 3-bit KV would push past 200K context.

Status (April 2026): No official Google implementation. Community ports exist for llama.cpp (TQ3_0 format, not merged), vLLM (Triton kernels, not merged), and PyTorch/HuggingFace. None are in mainline frameworks yet. Official integration expected Q2-Q3 2026.

Hardware cost to match API context

API capability Best local equivalent Hardware cost What you actually get
GPT-4.1 at 128K (A- quality) 8B Q4 on RTX 4090 ~EUR 1,800 128K context but C+ quality — 2 tiers below
GPT-4.1 at 128K (A- quality) 70B Q4 on 2x RTX 4090 ~EUR 3,500 128K context at B+ quality — 1 tier below
Claude 1M (A+ quality) 70B Q4 on A100 80GB ~EUR 15,000 Capped at 128K (872K shorter), B+ quality
Claude 1M (A+ quality) Nothing No open model trained past 128K
Gemini 1M (A+ quality) Nothing Not achievable locally at any price

The gap is stark beyond 128K. No combination of hardware and open models reaches 200K context. All three frontier API providers now offer 1M at A+ quality. Self-hosted tops out at 128K at B+ quality. For workloads that need the full context — codebase analysis, long document processing, extended agent sessions — APIs are the only option.

Working around the 128K ceiling

Several techniques exist to process more data than fits in a single context window. None of them are transparent substitutes for native long context.

RAG (Retrieval-Augmented Generation) is the most production-ready approach. Split documents into chunks, embed them in a vector database, retrieve only the relevant chunks per query. Quality reaches 70-85% of native long context for retrieval tasks (EMNLP 2024), but fundamentally cannot do cross-document reasoning — connecting a clause on page 40 to a definition on page 3 requires both to be in context simultaneously. RAG turns the problem from "LLM reasons over all data" into "retrieval system selects data, LLM reasons over selection." The retrieval quality becomes the ceiling.

Context compression (LLMLingua, Microsoft) removes unimportant tokens to fit more into the same window. At 2-4x compression, quality loss is minimal. At 10-20x, degradation is measurable. A 128K window with 4x compression gives roughly 512K effective tokens — useful, but still below 1M native context and with quality loss on compressed portions.

Map-reduce / chunking processes each chunk independently, then combines results. Works for summarization (60-80% quality). Fails for reasoning that requires cross-chunk information.

Hierarchical summarization — summarize chunks, then summarize summaries — works for gist but fails for precision. Each summarization round is lossy, and hallucination amplification compounds: a fabricated fact in round 2 becomes "established fact" in round 3.

StreamingLLM solves a different problem. It enables continuous generation over arbitrarily long streams by preserving "attention sink" tokens, but the model can only see the current window. Information outside the window is gone. Useful for long chat sessions, not for document analysis.

Technique Production ready? Quality vs native long context Real substitute?
RAG Yes 70-85% (retrieval) / poor (reasoning) Partial — good for search, poor for synthesis
LLMLingua compression Yes 85-95% at 2-4x compression Partial — extends window modestly
Map-reduce / chunking Yes 60-80% depending on task No — loses cross-chunk connections
Hierarchical summarization Yes (for summaries) 50-70% for detail; good for gist No — lossy compression compounds
StreamingLLM Yes (for chat) N/A (different problem) No — does not extend reasoning context

The LaRA benchmark (ICML 2025) tested 2,326 cases across multiple tasks and concluded there is no silver bullet — the best approach depends on task type, model capabilities, and retrieval characteristics. One concrete data point: switching from chunked medical records to full-context patient histories improved diagnostic accuracy by 23% because the model could see temporal patterns across years of data.

The "lost in the middle" problem (Liu et al., TACL 2024) complicates even native long context: models perform best when relevant information is at the beginning or end of the context, with 30%+ performance degradation for information in the middle. This affects all models, including those designed for long context.

For self-hosters, the practical hierarchy is: if the workload fits within 128K, local models are competitive. If it needs 128K-512K, RAG or compression can partially bridge the gap with measurable quality penalties. Beyond 512K, APIs with native 1M context are the only reliable option.

API pricing (April 2026)

API pricing dropped roughly 80% since early 2025. Budget-tier models now cost $0.02-0.40 per million tokens.

Tier Model Input $/MTok Output $/MTok Context Quality
Budget Mistral Nemo $0.02 $0.04 128K C+
Budget GPT-4.1-nano $0.05 $0.20 1M C+
Budget GPT-4.1-mini $0.16 $0.64 1M B-
Mid DeepSeek V3.2 $0.28 $0.42 128K B+
Mid Gemini 2.5 Flash $0.15 $3.50 1M B+
Mid GPT-4.1 $2.00 $8.00 1M A-
Mid Grok 3 $3.00 $15.00 128K A-
Mid Claude Haiku 4.5 $1.00 $5.00 200K A-
Frontier Claude Sonnet 4.6 $3.00 $15.00 1M A+
Frontier Gemini 2.5 Pro $1.25 $10.00 1M A+
Frontier Claude Opus 4.6 $5.00 $25.00 1M S
Frontier OpenAI o3 $2.00 $8.00 200K S (reasoning)

Subscription tiers exist for heavy users: Claude Pro at $20/month, Claude Max at $100-200/month with higher rate limits. Google, OpenAI, and others have similar tiered access. The per-token rates above are the pay-as-you-go ceiling, not the effective cost for regular users.

Cached input pricing drops costs further. Claude cache hits cost 0.1x the base rate. Gemini 2.5 Flash cached input is $0.03/MTok. Workloads with repeated context (RAG, agent loops) benefit from caching.

The cost comparison

Self-hosted at 100% utilization

Best case for self-hosting: a GPU running batch inference around the clock.

Setup Model Monthly cost (3yr amort.) Tokens/month EUR/MTok Quality
RTX Pro 6000 70B vLLM batched ~EUR 380 ~2.7B EUR 0.14 B/B+
RTX Pro 6000 8B vLLM batched ~EUR 380 ~23B EUR 0.017 C+
2x RTX 3090 8B single-stream ~EUR 130 3.7B EUR 0.48 C+
Strix Halo 70B single-stream ~EUR 90 ~13M EUR 0.50 B/B+

All costs in EUR. RTX Pro 6000 at ~$8,500 = ~EUR 7,800 at April 2026 exchange rates. 3-year amortization includes hardware + electricity at EUR 0.15/kWh.

At 100% utilization, self-hosted 8B on an RTX Pro 6000 matches Mistral Nemo API pricing (EUR 0.034/MTok) while running the model locally. Self-hosted 70B at EUR 0.14/MTok undercuts DeepSeek V3.2 API (EUR 0.39/MTok) for comparable B+ output.

The utilization trap

Nobody runs a single-operator GPU at 100%. You ask questions during working hours. The GPU idles overnight, on weekends, during meetings. Realistic utilization for one person or a small team: 10-30%.

Utilization RTX Pro 6000 70B EUR/MTok vs DeepSeek V3.2 API (EUR 0.39)
100% EUR 0.14 Self-hosted 2.8x cheaper
50% EUR 0.28 Self-hosted 1.4x cheaper
20% EUR 0.70 API 1.8x cheaper
10% EUR 1.40 API 3.6x cheaper

At 20% utilization, self-hosted 70B costs EUR 0.70/MTok for B+ quality output. The same money buys Gemini 2.5 Flash at B+ quality via API, with no hardware to maintain and 1M token context.

The GPU does not know you went to lunch. It depreciates whether it generates tokens or not.

Hardware depreciation

The RTX Pro 6000 costs ~EUR 7,800 today. When NVIDIA ships Rubin (next-generation architecture, expected ~2027), the Pro 6000 will be worth roughly EUR 3,700-4,600. Three years out: EUR 2,300-2,800. GPU depreciation runs 40-60% over 2-3 years.

API subscriptions depreciate at 0%. Every month brings the latest models. No replacement planning, no resale.

The lease option (EUR 500/month for 18 months, EUR 9,000 total) costs more than buying outright.

Quantization

Quantization compresses model weights to use less memory and run faster, at some quality cost.

70B model

Quantization Bits/weight 70B model size Quality vs FP16 Speed vs FP16 Minimum GPU
FP16 (full precision) 16.0 ~140 GB 100% 1.0x (baseline) 2x H100 SXM
Q8_0 8.0 ~70 GB ~99% ~1.8-2.0x 1x RTX Pro 6000
Q6_K 6.5 ~54 GB ~99% ~2.3-2.5x 1x RTX Pro 6000
Q5_K_M 5.5 ~48 GB ~96-97% ~2.5-3.0x 1x RTX Pro 6000
Q4_K_M (sweet spot) 4.8 ~40 GB ~92-95% ~3.0-3.5x 1x RTX Pro 6000
Q3_K_M 3.4 ~33 GB ~85-90% ~3.5-4.0x 2x RTX 3090
Q3_K_S 3.0 ~28 GB ~82-88% ~4.0-4.5x 2x RTX 4060 Ti 16GB
Q2_K (emergency) 2.7 ~27 GB ~75-85% ~4.0-5.0x 1x RTX 5090
AQLM 2-bit (trained codebooks) 2.0 ~18 GB ~95-98% GPU-only 1x RTX 4090
BitNet 1.58-bit (hypothetical) 1.58 ~14 GB unknown at 70B CPU-native 1x RTX 3060 (no model exists)

8B model (Llama 3.1 8B)

Quantization Bits/weight 8B model size Quality vs FP16 Minimum GPU
FP16 16.0 ~16 GB 100% 1x RTX 4090 (24 GB)
Q8_0 8.0 ~8.5 GB ~99% 1x RX 5700 XT (8 GB, tight)
Q6_K 6.5 ~6.3 GB ~99% 1x RX 5700 XT (8 GB)
Q5_K_M 5.5 ~5.7 GB ~98-99% 1x RX 5700 XT (8 GB)
Q4_K_M (sweet spot) 4.8 ~4.7 GB ~92-95% 1x GTX 1060 6GB
Q3_K_M 3.4 ~3.7 GB ~85-90% 1x GTX 1050 Ti (4 GB, tight)
Q2_K 2.7 ~3.1 GB ~75-85% 1x GTX 1050 Ti (4 GB)

Gemma 4 26B-A4B (MoE)

Mixture-of-experts models store all 26B parameters but only activate 4B per token. Storage follows total parameter count; inference speed follows active parameter count.

Quantization Total size (26B stored) Active per token (4B) Minimum GPU Effective speed
FP16 ~52 GB ~8 GB 1x RTX Pro 6000 Baseline
Q8_0 ~26 GB ~4 GB 1x RTX 5090 (32 GB) ~2x
Q4_K_M ~16 GB ~2.4 GB 1x RTX 4060 Ti 16GB ~3-3.5x
Q2_K ~8 GB ~1.3 GB 1x RX 5700 XT (8 GB) ~4-5x, quality loss

At Q4_K_M, the full Gemma 4 MoE fits on a 16 GB card with room for KV cache, while only reading ~2.4 GB of active weights per token. That makes it extremely fast: the RTX Pro 6000 hits 252 t/s on a similar MoE architecture (Qwen3 30B-A3B).

Reading the tables

Q4_K_M is the standard tradeoff: 75% size reduction with 92-95% quality retention. Nearly every local LLM benchmark uses Q4_K_M as default.

Q6_K and Q5_K_M sit between Q8 and Q4 — negligible quality loss but smaller than Q8. Useful when you have the VRAM to spare but not enough for Q8.

Going below Q4 hurts. Q3_K_M is usable but noticeably worse on complex reasoning. Q2_K loses 15-25% of model quality. At that point the model is fighting quantization artifacts on top of its limitations versus frontier models. A 70B at Q2 produces worse output than a 32B at Q4.

AQLM 2-bit is the exception: trained codebooks retain far more quality than naive Q2_K at the same bit width. The tradeoff is GPU-only inference and limited model availability.

Extreme quantization: ternary weights and sub-2-bit

All of the above methods are post-training quantization, compressing an FP16 model after it has been trained. A different approach trains the model with quantized weights from the start.

BitNet b1.58 (Microsoft Research / Tsinghua University) constrains every weight to one of three values: {-1, 0, +1}. That is log2(3) = 1.58 bits per weight. Multiplications become additions. A 2B parameter model fits in 0.4 GB instead of ~4 GB at FP16, roughly 10x smaller.

The only production-quality open model is bitnet-b1.58-2B-4T (2 billion parameters, 4 trillion training tokens, Apache 2.0 license). Community models exist at 8B but are experimental.

Model MMLU (5-shot) Size Speed (i7-13800H) EUR/MTok (15W laptop)
BitNet b1.58 2B 53.17 0.4 GB ~34 t/s EUR 0.006
Llama 3.2 1B (FP16) 49.30 ~2.5 GB ~21 t/s EUR 0.010
Qwen2.5 1.5B (FP16) 60.25 ~3 GB ~15 t/s EUR 0.014

The energy savings are real: 71-82% less energy than FP16 on x86, 55-70% on ARM. These numbers come from Microsoft's bitnet.cpp paper and have been verified across multiple sources. The inference framework (bitnet.cpp) is production-ready on CPU, with GPU support added in May 2025.

There is one catch that matters for anyone evaluating this: speed and energy benefits only appear when using bitnet.cpp. Loading the model in standard HuggingFace transformers gives you the smaller memory footprint but none of the ternary kernel speedups. You cannot use Ollama, LM Studio, or any other llama.cpp frontend for native BitNet inference as of April 2026.

Post-training methods can also push below 2 bits. AQLM (Yandex/IST Austria, ICML 2024) compresses Llama 2 7B to 2.02 bits with a WikiText2 perplexity of 6.59, versus 5.47 at FP16, roughly 2-5% task quality degradation. QTIP (NeurIPS 2024 Spotlight) achieves similar quality at 2-bit with >3x inference speedup. These are GPU-only methods: they reduce VRAM usage, not power draw.

Does extreme quantization change the economics?

At first glance, 10x smaller models should flip the self-hosting math. A 2B model in 0.4 GB running on a 15W laptop at 25 tokens per second costs about EUR 0.006 per million tokens in electricity, 6x cheaper than Mistral Nemo API.

The problem is quality. BitNet 2B scores MMLU 53. Mistral Nemo 12B scores around 70. These are not the same tier of model. A 2B model, regardless of quantization method, cannot follow complex instructions, write useful code, or maintain coherence over long conversations. Saving EUR 0.03 per million tokens while getting dramatically worse output is not a savings.

The real promise is running larger models on cheaper hardware. A hypothetical 70B BitNet model would need ~18 GB instead of 140 GB, fitting on a single RTX 3090. But the model zoo has not caught up to the technique. Microsoft's verified model is 2B. PrismML released Bonsai 8B in March 2026, claiming 1.15 GB footprint and 368 t/s on an RTX 4090, but these are vendor claims and independent benchmarks are still sparse. Until larger ternary models are independently verified, extreme quantization changes what is theoretically possible without changing what you can actually run today.

Software

Running a model locally in 2026 is straightforward. The tooling is good.

Ollama wraps llama.cpp behind a REST API with model management. Install, pull a model, start chatting.

llama.cpp is the foundation. CUDA, Vulkan, Metal, CPU with AVX-512/AMX. GGUF quantization format.

ik_llama.cpp is an optimized fork of llama.cpp with rewritten CPU kernels. 1.8-5.2x faster prompt processing on AVX2/AVX-512 CPUs. Same GGUF models, drop-in replacement. Use this instead of mainline llama.cpp for CPU-bound workloads. Benchmarks in the CPU comparison section.

vLLM handles concurrent serving with PagedAttention and continuous batching. Use this when serving multiple users from one GPU.

bitnet.cpp is required for native ternary inference. Separate from llama.cpp. CPU-primary, with CUDA support since May 2025. Requires Clang 18+ to build.

The cheapest hardware myth

A common misconception: "I already own the hardware, so inference is free." It is not. Electricity costs money, and consumer GPUs draw 230-415W under inference load. Even ignoring the purchase price entirely, the electricity to generate tokens locally costs more per token than calling a budget API.

Proof: idle desktop CPU

Take an Intel i5-7500T in an HP ProDesk 400 G3 Desktop Mini. A machine that costs EUR 150 used and sits idle most of the day. DDR4-2400 dual-channel gives ~38 GB/s theoretical memory bandwidth. At 50% efficiency with 4 threads, that translates to roughly 4-5 tokens per second on an 8B Q4 model.

The system draws about 22W idle and 45W under inference load. The marginal 23W for inference costs EUR 2.48 per month in electricity at Finnish rates (EUR 0.15/kWh).

Model Size on disk Tokens/sec (est.) Tokens/month (24/7) EUR/MTok (electricity only) vs Mistral Nemo API
Qwen3 0.6B Q4 ~0.4 GB ~40 103.7M EUR 0.024 1.5x cheaper, but trivial tasks only
BitNet b1.58 2B (ternary) 0.4 GB ~15-25 51.8M EUR 0.048 1.3x more expensive, D-tier quality
TinyLlama 1.1B Q4 ~0.7 GB ~28 72.6M EUR 0.034 Break-even, basic completion only
Bonsai 8B (1-bit, unverified) 1.15 GB ~15 38.9M EUR 0.064 1.7x more expensive, claims 8B quality
Phi-4 Mini 3.8B Q4 ~2.2 GB ~7.6 19.7M EUR 0.126 3.4x more expensive
Llama 3.1 8B Q4 ~4.7 GB ~4-5 11.7M EUR 0.212 5.7x more expensive

BitNet b1.58 2B runs via bitnet.cpp (not llama.cpp). The ternary weights reduce memory reads, but the i5-7500T's 4 cores with AVX2-only still cap throughput around 15-25 t/s. Bonsai 8B speed is estimated from memory bandwidth; PrismML's claimed 368 t/s is on an RTX 4090 GPU. Independent CPU benchmarks for Bonsai are sparse as of April 2026.

At the same C+ quality tier (8B model), the API costs EUR 0.037 per million tokens. The i5-7500T costs EUR 0.212. The API is 5.7 times cheaper on electricity alone, before counting the computer's purchase price.

The extreme quantization models do not change the math. BitNet 2B at EUR 0.048/MTok is 1.3x more expensive than the API and produces D-tier output (MMLU 53 vs Mistral Nemo's ~70). The only models where electricity beats the API are sub-1B parameter models at Q4, and their output quality is so far below Mistral Nemo that the comparison is meaningless.

CPU comparison: DDR4 to DDR5

The i5-7500T is the cheapest case. How do faster CPUs compare? Memory bandwidth is the bottleneck — more channels, faster memory, more tokens per second.

CPU Memory Bandwidth (theoretical) Cores System idle/load Llama 3.1 8B Q4 t/s (est.) EUR/MTok (marginal electricity)
Intel N100 DDR5-4800 1-ch 38.4 GB/s 4E 8W / 16W ~3 EUR 0.111
Intel i3-N305 DDR5-4800 1-ch 38.4 GB/s 8E 10W / 22W ~3.5 EUR 0.143
Intel i5-7500T DDR4-2400 2-ch 38.4 GB/s 4C/4T 22W / 45W ~4-5 EUR 0.212
Intel i5-8500T DDR4-2666 2-ch 42.7 GB/s 6C/6T 22W / 48W ~5-6 EUR 0.180
AMD Ryzen 5 5600X DDR4-3200 2-ch 51.2 GB/s 6C/12T 45W / 95W ~6-7 EUR 0.297
AMD Ryzen 9 5900X DDR4-3200 2-ch 51.2 GB/s 12C/24T 55W / 120W ~6-8 EUR 0.338
AMD Ryzen 9 7950X DDR5-5200 2-ch 83.2 GB/s 16C/32T 65W / 145W ~10-12 EUR 0.277

Electricity rate: EUR 0.15/kWh. EUR/MTok calculated from marginal power (load minus idle) at estimated token rate, 24/7 operation. N100 and i3-N305 are single-channel memory, which caps bandwidth despite DDR5 speeds. The Ryzen 7950X is 2-3x faster than the i5-7500T thanks to DDR5 dual-channel (83 GB/s vs 38 GB/s).

Every CPU in this table costs 3x to 9x more per token in electricity than the Mistral Nemo API (EUR 0.037/MTok). The fastest desktop CPU tested (Ryzen 7950X) still costs 7.5x more. More cores do not help — dual-channel DDR4 or DDR5 memory bandwidth is the wall.

One exception to the "llama.cpp is llama.cpp" assumption: ik_llama.cpp, an optimized fork, achieves 1.8-5.2x faster prompt processing than mainline llama.cpp on the same hardware. Benchmarks on a Ryzen 7950X (16 threads, AVX2):

Quantization ik_llama.cpp (t/s) llama.cpp (t/s) Speedup
BF16 256.9 78.6 3.3x
Q8_0 268.2 147.9 1.8x
Q4_0 273.5 153.5 1.8x
IQ3_S 156.5 30.2 5.2x
Q8_K_R8 370.1 N/A new format

These are prompt processing (prefill) speeds, not generation. Generation improves a more modest 1.0-2.1x. The key finding: on the Ryzen 7950X, the slowest quantization type in ik_llama.cpp is faster than the fastest type in mainline llama.cpp for prompt processing. For CPU inference workloads that involve large prompts — document summarization, RAG extraction, batch classification — the fork makes a material difference. It does not change the electricity cost comparison (the watts are the same), but it reduces wall-clock time per job, which matters for throughput-limited batch work.

Proof: every consumer GPU loses

The pattern holds across dedicated GPUs. Inference power draw runs about 55-75% of the card's TDP (the decode phase is memory-bandwidth-bound, not compute-bound). Add 120W for the rest of the system.

GPU VRAM Llama 3.1 8B Q4 tok/s System watts EUR/MTok (electricity) vs API
RX 5700 XT 8 GB ~55 (Vulkan, no ROCm) 266W EUR 0.201 5.4x
RX 6700 XT 12 GB ~45 270W EUR 0.250 6.8x
RX 6900 XT 16 GB ~60 315W EUR 0.219 5.9x
RTX 3060 12 GB ~42 230W EUR 0.228 6.2x
RTX 3090 24 GB ~87 345W EUR 0.165 4.5x
RTX 4060 Ti 16GB 16 GB ~48 227W EUR 0.197 5.3x
RTX 4090 24 GB ~128 413W EUR 0.134 3.6x
RX 7900 XTX 24 GB ~80 351W EUR 0.183 4.9x

API baseline: Mistral Nemo at EUR 0.037/MTok ($0.04/MTok output). Electricity rate: EUR 0.15/kWh. RX 5700 XT runs Vulkan only (ROCm dropped Navi 10 support); speed scaled from llama.cpp Vulkan scoreboard data.

Every consumer GPU costs 3.6 to 6.8 times more in electricity per token than the API. The RTX 4090, the fastest consumer card, still costs 3.6x more. At German electricity rates (EUR 0.30/kWh), it costs 7.2x more.

The ex-mining AMD cards are fast for their price but still lose to the API. The RX 5700 XT draws 266W system power for ~55 tokens per second — its wide 256-bit memory bus (448 GB/s) makes it the fastest sub-$200 used card. The RX 6900 XT draws 315W for ~60 t/s: 18% more power for 9% more speed over the 5700 XT, because both share a 256-bit bus and differ mainly in clock speed.

For the best consumer GPU (RTX 4090) to break even with Mistral Nemo API on electricity alone, the electricity rate would need to be EUR 0.041/kWh. That is below any residential rate in Europe.

Enterprise GPUs: faster but still losing

Enterprise GPUs have HBM memory with 2-4x the bandwidth of consumer GDDR. They are faster per token but draw more power and sit in servers with higher base consumption. System power assumes +200W for a server chassis (vs +120W for a desktop).

GPU VRAM Bandwidth 8B Q4 tok/s (est.) System watts EUR/MTok (electricity) vs API Purchase price (Apr 2026 est.)
L40S 48 GB GDDR6 864 GB/s ~110 550W EUR 0.208 5.6x $7,500-9,000
A100 40GB PCIe 40 GB HBM2e 1,555 GB/s ~195 450W EUR 0.096 2.6x $5,000-8,000 (used)
A100 80GB SXM 80 GB HBM2e 2,039 GB/s ~260 600W EUR 0.096 2.6x $12,000-18,000
RTX Pro 6000 96 GB GDDR7 1,792 GB/s ~279 (7B bench) 720W EUR 0.107 2.9x $8,000-9,200
H100 SXM 80 GB HBM3 3,352 GB/s ~400 900W EUR 0.094 2.5x $22,000-30,000

The H100, the fastest GPU in this comparison, still costs 2.5x more per token in electricity than the API. The L40S is worse than consumer GPUs per token because its 864 GB/s bandwidth is slower than the RTX 4090's 1,008 GB/s but draws more system power in a server chassis.

Enterprise GPUs close the gap but do not flip the math. Where they earn their price is batched serving: an H100 running vLLM with 32 concurrent users achieves aggregate throughput that amortizes the power draw across requests. Single-stream single-user inference, which is what this table measures, is the worst case for expensive hardware.

What this means

Hardware cost, depreciation, driver maintenance, cooling, and noise all come on top of electricity. If electricity alone already loses to the API, the total cost is not getting better.

Self-hosting has real advantages: privacy, zero network latency, offline capability, fine-tuning. Those are legitimate reasons to run local inference. "It is cheaper" is not one of them.

When self-hosting makes sense

If data cannot leave your premises (regulatory, air-gapped, classified), API access is off the table. Self-hosted is the only option and cost comparisons do not apply.

Bulk classification, embedding generation, and document processing with a 7-8B model on modest hardware is cheap and fast. A used RTX 3090 ($800-900) runs Llama 3.1 8B at 87-112 t/s depending on configuration. These tasks do not need frontier quality.

Genuine 24/7 batch workloads keeping a GPU above 80% utilization get real per-token savings. This applies to organizations running inference as a service, not individuals using LLMs as assistants.

Fine-tuning requires local hardware. You cannot fine-tune through a subscription. LoRA adapters for an 8B model train on a single RTX 3090 in hours. Full fine-tuning of a 70B model needs 2-4x 80 GB GPUs. If your use case requires a custom model trained on proprietary data, self-hosting is not optional.

Needle-in-a-haystack retrieval and semantic search over large private document collections are where local inference earns its cost. RAG pipelines that embed and search hundreds of thousands of documents generate millions of tokens per day in embedding and extraction queries. These are high-volume, low-quality-bar tasks where an 8B model is sufficient and API costs would compound. A single GPU running embeddings at 80-100% utilization processes the volume at a fraction of API pricing because the per-token overhead is amortized over sustained throughput.

Embedding models are often the strongest case for local deployment. Embedding models (22M-334M parameters) are distinct from generative LLMs — they produce vector representations for semantic search and RAG, not text. They run on any CPU at hundreds to thousands of embeddings per second with negligible power draw and 1-2 GB RAM total.

Model Parameters Use case Speed (CPU) RAM
all-minilm-l6-v2 22M fast filtering, similarity ~5,000 embed/sec <1 GB
nomic-embed-text 274M semantic search, RAG ~1,000 embed/sec ~1 GB
mxbai-embed-large 334M high-quality embeddings ~500 embed/sec ~1.5 GB

A 22M embedding model running on any desktop CPU powers local semantic search over thousands of documents with zero API cost. For infrastructure use cases — log analysis, document similarity, RAG retrieval — embedding models deliver more production value than 1-4B generative models. They are the one category where "I already own the hardware" is genuinely true: the compute is trivial, the electricity is negligible, and the privacy benefit is real. API-based embedding (OpenAI ada-002 at $0.10/MTok) adds up fast at scale; local embedding is effectively free after the one-time setup.

Document summarization at scale follows the same pattern. Summarizing 10,000 PDFs through an API costs real money at $0.04-3.00 per million tokens depending on the model tier. On local hardware, the marginal cost is electricity and the model runs as fast as bandwidth allows. A Ryzen 7950X with 64 GB RAM can summarize documents at ~10 t/s on an 8B model continuously without per-request billing. An RTX 3090 does the same at ~87 t/s.

When APIs win

For coding, multi-step reasoning, and agentic workflows, APIs win on both quality and effective cost.

A B+ local model looks adequate on simple prompts. The gap shows on hard problems: multi-file refactoring, agent coherence past 20 turns, judgment calls in ambiguous situations. The output compiles. It does the wrong thing. These are the tasks where LLMs are worth using, and where frontier models justify their pricing.

At 20-30% utilization, APIs often cost less per token than self-hosted inference. Add maintenance (driver bugs, cooling, kernel compatibility, GPU replacement) and the gap widens.

The DeepSeek V3 case study

DeepSeek V3 API costs about EUR 260/month for a heavy workload (130K queries/month at 74K token context). Self-hosting the same model requires 8x H100 SXM GPUs.

Configuration Upfront cost Monthly operating vs API (EUR 260/mo)
DeepSeek V3 on 8x H100 (new) EUR 320K EUR 9,600 37x more
DeepSeek V3 on 8x H100 (used) EUR 200K EUR 6,260 24x more
DeepSeek V3 on 2x MI300X EUR 37K EUR 1,230 5x more
Qwen3-30B-A3B on 2x RTX 4090 (substitute) EUR 7,600 EUR 350 35% more (lower quality)

Even the cheapest alternative that might work — a smaller model on consumer GPUs — costs more per month than the API. Break-even for the Qwen3-30B-A3B option: EUR 7,600 hardware / EUR 105 monthly savings on electricity = 72 months (6 years). With 3-year hardware amortization included, break-even is never reached.

The API is almost certainly priced below marginal cost for individual users. DeepSeek benefits from massive scale, strategic pricing, and Chinese electricity rates. Competing on price with a subsidized API by buying your own hardware is a losing proposition at any scale short of running your own inference service.

API pricing is variable: pay for what you use. GPU hardware is a fixed cost that depreciates regardless. Workloads that vary week to week waste less money on APIs.

At Pulsed Media

Pulsed Media runs its own hardware in owned datacenter space in Finland. GPU inference cards like the RTX Pro 6000 go through the same depreciation, power cost, and rack space analysis as any other piece of infrastructure.

For most workloads in 2026, cloud APIs deliver better quality per euro than local inference. Bulk classification, embeddings, and privacy-constrained tasks are where the GPU math works. 16 years of hardware purchasing says: buy when the math supports it, use the API when it does not.

Seedboxes and dedicated servers from Pulsed Media run on the same Finnish datacenter infrastructure, with the same cost-per-unit discipline applied to every piece of hardware. See current plans.

References

Benchmarks and performance data

  • LMSYS Chatbot Arena ELO Leaderboard — model quality rankings used for Arena ELO scores throughout this article
  • llama.cpp Vulkan GPU Scoreboard — community-submitted GPU inference benchmarks (Llama 2 7B Q4_0, tg128). Source for RX 5700 XT (70.73 t/s), GTX 1050 Ti (20.96 t/s), RTX 3060 (75.94-80.59 t/s), and other consumer GPU speeds
  • llama.cpp — the inference engine behind Ollama and most local LLM benchmarks
  • vLLM — batched serving engine; source for Pro 6000 throughput numbers (8B at 8,990 t/s, 70B at 1,031 t/s)

Model cards and technical reports

Research papers

CPU inference and memory bandwidth

  • AMD EPYC 9554 llama.cpp benchmarks (ahelpme.com) — EPYC 9554 achieves 50 t/s on 8B Q4_K_M, confirming memory bandwidth as primary bottleneck
  • OpenBenchmarking.org llama.cpp results — 88 public CPU benchmark results, median 14.2 t/s on 8B Q8_0
  • ik_llama.cpp CPU performance — Ryzen 7950X benchmarks showing 1.8-5.2x speedup over mainline llama.cpp for prompt processing
  • ik_llama.cpp — optimized llama.cpp fork with improved CPU kernels
  • CPU token rate formula: theoretical max t/s ≈ memory bandwidth (GB/s) / model size (GB); actual ~50-70% of theoretical due to overhead

Hardware specifications

  • Speeds in this article use Q4_K_M quantization unless otherwise noted. Consumer GPU system power includes GPU + ~120W for the rest of the system; enterprise GPUs use +200W for server chassis. Electricity rate: EUR 0.15/kWh (Finnish residential average). Enterprise GPU purchase prices are estimated April 2026 market rates and shift rapidly
  • VRAM context limits use Llama 3.1 8B KV cache parameters (128 MB/1K tokens at FP16, 64 MB/1K at Q8, 32 MB/1K at Q4). Other model architectures vary; Qwen3 8B uses ~144 MB/1K and Gemma 4 31B dense uses ~850 MB/1K due to its 16 KV-head architecture

See also

  • NVIDIA RTX Pro 6000 (Blackwell) — full specs, benchmarks, and known issues for the 96 GB GPU referenced throughout this article
  • Seedbox — PM's primary product, where the same hardware cost economics apply
  • NVMe — storage interface used alongside GPU inference servers
  • RAID — storage redundancy in PM's infrastructure