NVIDIA RTX Pro 6000 (Blackwell)

The NVIDIA RTX Pro 6000 is a professional workstation GPU based on the Blackwell architecture (GB202). It shipped in April 2025 with 96 GB of GDDR7 ECC memory, making it the only GPU under $10,000 that can run 70B parameter language models on a single card without quantizing below Q4.

Specifications

Specification	Value
Architecture	Blackwell (GB202), TSMC 4NP
CUDA Cores	24,064
Tensor Cores	752 (5th gen, native FP4/FP8)
VRAM	96 GB GDDR7 ECC
Memory Bus	512-bit
Memory Bandwidth	1,792 GB/s
PCIe	Gen 5 x16
NVLink	Not available
TDP	600W (Workstation), 300W (Max-Q), 450-600W (Server)
Compute Capability	12.0 (SM120)
MIG	Up to 4 instances
Price (April 2026)	$7,999-9,200

The card comes in three editions. The Workstation edition runs at 600W with active dual-fan cooling. The Max-Q uses the same full GB202 die but is power-limited to 300W with a passive/blower cooler, intended for multi-GPU density builds. The Server edition runs passively at 450-600W depending on the power cable installed.

All three editions share identical silicon: same 24,064 CUDA cores, same 96 GB GDDR7, same memory bandwidth. Only the power limit and cooling differ.

What fits on 96 GB

The 96 GB of VRAM is what sets this card apart for LLM work.

Model	Quantization	Weight Size	Fits?	Remaining for KV Cache
7-8B	Q4_K_M	~4-5 GB	Yes	~91 GB
14B	Q4_K_M	~8 GB	Yes	~88 GB
32B	Q4_K_M	~18 GB	Yes	~78 GB
70B	Q4_K_M	~38 GB	Yes	~58 GB
70B	FP8	~70 GB	Yes	~26 GB
120B MoE	Q8_0	~60 GB	Yes	~36 GB
405B	IQ2_XXS	~100 GB	Barely	Minimal (2.68 tok/s)
405B	Q4_K_M	~200 GB	No	Needs 2+ GPUs

70B models at Q4 quantization hit the practical optimum. At 38 GB for weights, 58 GB remains for KV cache, enough for long context windows. No consumer GPU comes close: the RTX 5090 tops out at 32 GB (max ~32B at Q4) and the RTX 4090 at 24 GB.

LLM inference benchmarks

Single-stream generation (llama.cpp)

Tested on the Workstation edition at 600W, Ubuntu 24.04, CUDA 12.8:

Model	Quantization	Tokens/sec (generation)
Llama 2 7B	Q4_0	279
Mistral Nemo 12B	Q4_K_M	163
Qwen3 30B-A3B (MoE)	Q4_K_M	252
Llama 3.3 70B	Q4_K_M	34
Llama 3.1 405B	IQ2_XXS	2.68

34 tokens per second on a 70B model is usable for interactive work. Roughly 25 words per second, fast enough that you are not waiting for the model to finish a paragraph.

Batched serving (vLLM)

With concurrent requests, throughput scales well beyond single-stream numbers:

Model	GPU Config	Throughput (tok/s)
30B (Qwen3-Coder)	1x Pro 6000	8,425
70B (Llama 3.3)	1x Pro 6000	1,031
96GB model (GLM-4.5-Air)	1x Pro 6000	3,169
96GB model (GLM-4.5-Air)	1x H100 SXM	2,987

On models that fit on a single card, the Pro 6000 matches or beats the H100 SXM at one-third the hardware cost. The H100 pulls ahead only when you need multi-GPU tensor parallelism, where its NVLink (900 GB/s) leaves the Pro 6000's PCIe (128 GB/s) behind.

Power consumption

The 600W TDP is the ceiling, not the norm. Real-world power depends on the workload.

LLM token generation is memory-bandwidth bound, not compute bound. The GPU spends most of its time waiting on memory reads rather than doing math. Inference draws well below the rated TDP.

Workload	Typical GPU Draw	Notes
Idle	17-20W	Comparable to consumer cards
Small model inference (<10B)	80-150W	Memory bound, barely uses compute
Large model inference (70B decode)	200-350W	Sustained generation
Prompt processing (prefill)	350-500W	Compute-heavy phase
Training/fine-tuning	450-530W	Sustained high compute

Running at 450W

Naveen Kulandaivelu tested multiple models at 300W, 450W, and 600W power limits. At 450W, inference retains 85-95% of full-power performance while cutting power draw by 25%.

Model	600W tok/s	450W tok/s	Retention
Qwen3-32B Q4	34.3	31.1	91%
Llama-4-Scout Q4 (46GB)	64.2	60.9	95%
Qwen3-14B Q4	76.5	65.9	86%
Qwen3-235B Q2 (46GB)	32.4	28.9	89%

For an always-on inference server, 450W is the obvious setting. Annual electricity at 450W sustained and EUR 0.15/kWh comes to roughly EUR 590, versus EUR 790 at full power.

Multi-GPU limitations

The Pro 6000 has no NVLink. Multi-GPU communication goes over PCIe Gen 5 x16: about 128 GB/s bidirectional, versus 900 GB/s on H100 NVLink.

For data parallelism (running separate model copies on each GPU for more concurrent requests), this does not matter. Each GPU works independently.

For tensor parallelism (splitting a single large model across GPUs), the PCIe bottleneck is real. CloudRift benchmarks showed 8x RTX Pro 6000 reaching only about one-third the throughput of 8x H100 SXM on models that require 8-way tensor parallelism.

This card works best with models that fit on one or two cards. If you need to split across four or more GPUs, datacenter hardware with NVLink is worth the price premium.

Known issues (April 2026)

The card shipped with several documented problems. Some have partial fixes, others remain open.

The open-source NVIDIA kernel module is mandatory from Blackwell onward. The proprietary driver does not work and the GPU Security Processor (GSP) cannot be deactivated.

SM120 (Blackwell) is not backward compatible with SM100 (Hopper). Kernels compiled for SM100 fail on SM120. This breaks DeepSeek models in vLLM and causes ML framework compatibility issues across the board. You cannot run Blackwell alongside older NVIDIA GPUs on the same host.

A virtualization reset bug causes the GPU to enter an unrecoverable state after VM shutdown or reassignment. PCI config reads all 0xff and nothing short of a full system power cycle fixes it. A $1,000 bounty was posted. Partial fixes arrived in driver 580+.

Under sustained vLLM inference, the GPU can hit NV_ERR_GPU_IN_FULLCHIP_RESET, showing ERR! in nvidia-smi. This has been reported at 28C, so it is not thermal. It persists across all tested driver versions.

Killing hung NVIDIA processes can corrupt the vBIOS permanently. The GPU then reads as ??.??.??.?? with Xid 143 errors. No software recovery exists; only RMA works. NVIDIA does not provide public vBIOS downloads.

The Server Edition has a cable trap: the default Supermicro power cable (CBL-PWEX-0962Y-30) limits power to 450W via sense pins. The CBL-PWEX-1364Y-30 cable is needed for 600W.

Multi-Instance GPU (MIG) requires vBIOS version 98.02.55.00.00 or later. Many cards shipped below this version.

The passive Server Edition reaches 100C within minutes under compute load and throttles at 104-105C. Without active airflow, it sustains about 125W. High-RPM fans in a directed shroud are mandatory for server deployment.

Comparison to other GPUs

GPU	VRAM	Bandwidth	TDP	Price	Max Model (Q4)
RTX Pro 6000	96 GB GDDR7	1,792 GB/s	600W	~$8,500	~120-140B
RTX 5090	32 GB GDDR7	1,792 GB/s	575W	~$3,000 street	~32B
RTX 4090	24 GB GDDR6X	1,008 GB/s	450W	~$1,600-2,000	~24B
L40S	48 GB GDDR6	864 GB/s	350W	~$8,800	~48B
H100 SXM	80 GB HBM3	3,352 GB/s	700W	~$27,500	~70B FP8

The Pro 6000 sits between consumer and datacenter: 3x the VRAM of the RTX 5090 at 3x the price, with matching bandwidth. It matches H100 throughput on single-GPU workloads at a third of the cost. The tradeoff is no NVLink and lower bandwidth than HBM, which only matters for multi-GPU scaling.

At $88.50 per GB of VRAM, it is the cheapest professional GPU by a wide margin. The H100 costs $343.75/GB.

Use cases

Where it works well:

Running 70B+ models locally for privacy, latency, or regulatory reasons
Development and testing before datacenter deployment
Serving multiple smaller models via MIG partitioning
High-utilization batch inference (>50% utilization) where break-even versus cloud is favorable
Air-gapped environments

Where it does not:

Models that fit on an RTX 5090 (32 GB), which offers better price-performance for sub-32B models
Large-scale serving with 4+ GPU tensor parallelism, where H100/H200 NVLink is far superior
Low utilization (<25%), where cloud rental or API subscriptions cost less
Replacing frontier closed models for quality-sensitive work. The best open model on 96 GB (Llama 3.3 70B or Qwen3-32B) scores 70-120 ELO below Claude Sonnet 4.6 on the LMSYS Arena. The quality gap is in the model weights, not the hardware

At Pulsed Media

Pulsed Media runs its own hardware in owned datacenter space. GPU inference cards like the RTX Pro 6000 are evaluated against the same cost-per-unit economics that drive Seedbox and dedicated server pricing. For most PM operations, cloud API subscriptions currently deliver better quality-per-euro than local GPU inference — but bulk classification and privacy-constrained workloads remain candidates for on-premise hardware.