NVIDIA RTX Pro 6000 (Blackwell)
The NVIDIA RTX Pro 6000 is a professional workstation GPU based on the Blackwell architecture (GB202). It shipped in April 2025 with 96 GB of GDDR7 ECC memory, making it the only GPU under $10,000 that can run 70B parameter language models on a single card without quantizing below Q4.
Specifications
| Specification | Value |
|---|---|
| Architecture | Blackwell (GB202), TSMC 4NP |
| CUDA Cores | 24,064 |
| Tensor Cores | 752 (5th gen, native FP4/FP8) |
| VRAM | 96 GB GDDR7 ECC |
| Memory Bus | 512-bit |
| Memory Bandwidth | 1,792 GB/s |
| PCIe | Gen 5 x16 |
| NVLink | Not available |
| TDP | 600W (Workstation), 300W (Max-Q), 450-600W (Server) |
| Compute Capability | 12.0 (SM120) |
| MIG | Up to 4 instances |
| Price (April 2026) | $7,999-9,200 |
The card comes in three editions. The Workstation edition runs at 600W with active dual-fan cooling. The Max-Q uses the same full GB202 die but is power-limited to 300W with a passive/blower cooler, intended for multi-GPU density builds. The Server edition runs passively at 450-600W depending on the power cable installed.
All three editions share identical silicon: same 24,064 CUDA cores, same 96 GB GDDR7, same memory bandwidth. Only the power limit and cooling differ.
What fits on 96 GB
The 96 GB of VRAM is what sets this card apart for LLM work.
| Model | Quantization | Weight Size | Fits? | Remaining for KV Cache |
|---|---|---|---|---|
| 7-8B | Q4_K_M | ~4-5 GB | Yes | ~91 GB |
| 14B | Q4_K_M | ~8 GB | Yes | ~88 GB |
| 32B | Q4_K_M | ~18 GB | Yes | ~78 GB |
| 70B | Q4_K_M | ~38 GB | Yes | ~58 GB |
| 70B | FP8 | ~70 GB | Yes | ~26 GB |
| 120B MoE | Q8_0 | ~60 GB | Yes | ~36 GB |
| 405B | IQ2_XXS | ~100 GB | Barely | Minimal (2.68 tok/s) |
| 405B | Q4_K_M | ~200 GB | No | Needs 2+ GPUs |
70B models at Q4 quantization hit the practical optimum. At 38 GB for weights, 58 GB remains for KV cache, enough for long context windows. No consumer GPU comes close: the RTX 5090 tops out at 32 GB (max ~32B at Q4) and the RTX 4090 at 24 GB.
LLM inference benchmarks
Single-stream generation (llama.cpp)
Tested on the Workstation edition at 600W, Ubuntu 24.04, CUDA 12.8:
| Model | Quantization | Tokens/sec (generation) |
|---|---|---|
| Llama 2 7B | Q4_0 | 279 |
| Mistral Nemo 12B | Q4_K_M | 163 |
| Qwen3 30B-A3B (MoE) | Q4_K_M | 252 |
| Llama 3.3 70B | Q4_K_M | 34 |
| Llama 3.1 405B | IQ2_XXS | 2.68 |
34 tokens per second on a 70B model is usable for interactive work. Roughly 25 words per second, fast enough that you are not waiting for the model to finish a paragraph.
Batched serving (vLLM)
With concurrent requests, throughput scales well beyond single-stream numbers:
| Model | GPU Config | Throughput (tok/s) |
|---|---|---|
| 30B (Qwen3-Coder) | 1x Pro 6000 | 8,425 |
| 70B (Llama 3.3) | 1x Pro 6000 | 1,031 |
| 96GB model (GLM-4.5-Air) | 1x Pro 6000 | 3,169 |
| 96GB model (GLM-4.5-Air) | 1x H100 SXM | 2,987 |
On models that fit on a single card, the Pro 6000 matches or beats the H100 SXM at one-third the hardware cost. The H100 pulls ahead only when you need multi-GPU tensor parallelism, where its NVLink (900 GB/s) leaves the Pro 6000's PCIe (128 GB/s) behind.
Power consumption
The 600W TDP is the ceiling, not the norm. Real-world power depends on the workload.
LLM token generation is memory-bandwidth bound, not compute bound. The GPU spends most of its time waiting on memory reads rather than doing math. Inference draws well below the rated TDP.
| Workload | Typical GPU Draw | Notes |
|---|---|---|
| Idle | 17-20W | Comparable to consumer cards |
| Small model inference (<10B) | 80-150W | Memory bound, barely uses compute |
| Large model inference (70B decode) | 200-350W | Sustained generation |
| Prompt processing (prefill) | 350-500W | Compute-heavy phase |
| Training/fine-tuning | 450-530W | Sustained high compute |
Running at 450W
Naveen Kulandaivelu tested multiple models at 300W, 450W, and 600W power limits. At 450W, inference retains 85-95% of full-power performance while cutting power draw by 25%.
| Model | 600W tok/s | 450W tok/s | Retention |
|---|---|---|---|
| Qwen3-32B Q4 | 34.3 | 31.1 | 91% |
| Llama-4-Scout Q4 (46GB) | 64.2 | 60.9 | 95% |
| Qwen3-14B Q4 | 76.5 | 65.9 | 86% |
| Qwen3-235B Q2 (46GB) | 32.4 | 28.9 | 89% |
For an always-on inference server, 450W is the obvious setting. Annual electricity at 450W sustained and EUR 0.15/kWh comes to roughly EUR 590, versus EUR 790 at full power.
Multi-GPU limitations
The Pro 6000 has no NVLink. Multi-GPU communication goes over PCIe Gen 5 x16: about 128 GB/s bidirectional, versus 900 GB/s on H100 NVLink.
For data parallelism (running separate model copies on each GPU for more concurrent requests), this does not matter. Each GPU works independently.
For tensor parallelism (splitting a single large model across GPUs), the PCIe bottleneck is real. CloudRift benchmarks showed 8x RTX Pro 6000 reaching only about one-third the throughput of 8x H100 SXM on models that require 8-way tensor parallelism.
This card works best with models that fit on one or two cards. If you need to split across four or more GPUs, datacenter hardware with NVLink is worth the price premium.
Known issues (April 2026)
The card shipped with several documented problems. Some have partial fixes, others remain open.
The open-source NVIDIA kernel module is mandatory from Blackwell onward. The proprietary driver does not work and the GPU Security Processor (GSP) cannot be deactivated.
SM120 (Blackwell) is not backward compatible with SM100 (Hopper). Kernels compiled for SM100 fail on SM120. This breaks DeepSeek models in vLLM and causes ML framework compatibility issues across the board. You cannot run Blackwell alongside older NVIDIA GPUs on the same host.
A virtualization reset bug causes the GPU to enter an unrecoverable state after VM shutdown or reassignment. PCI config reads all 0xff and nothing short of a full system power cycle fixes it. A $1,000 bounty was posted. Partial fixes arrived in driver 580+.
Under sustained vLLM inference, the GPU can hit NV_ERR_GPU_IN_FULLCHIP_RESET, showing ERR! in nvidia-smi. This has been reported at 28C, so it is not thermal. It persists across all tested driver versions.
Killing hung NVIDIA processes can corrupt the vBIOS permanently. The GPU then reads as ??.??.??.?? with Xid 143 errors. No software recovery exists; only RMA works. NVIDIA does not provide public vBIOS downloads.
The Server Edition has a cable trap: the default Supermicro power cable (CBL-PWEX-0962Y-30) limits power to 450W via sense pins. The CBL-PWEX-1364Y-30 cable is needed for 600W.
Multi-Instance GPU (MIG) requires vBIOS version 98.02.55.00.00 or later. Many cards shipped below this version.
The passive Server Edition reaches 100C within minutes under compute load and throttles at 104-105C. Without active airflow, it sustains about 125W. High-RPM fans in a directed shroud are mandatory for server deployment.
Comparison to other GPUs
| GPU | VRAM | Bandwidth | TDP | Price | Max Model (Q4) |
|---|---|---|---|---|---|
| RTX Pro 6000 | 96 GB GDDR7 | 1,792 GB/s | 600W | ~$8,500 | ~120-140B |
| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 575W | ~$3,000 street | ~32B |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 450W | ~$1,600-2,000 | ~24B |
| L40S | 48 GB GDDR6 | 864 GB/s | 350W | ~$8,800 | ~48B |
| H100 SXM | 80 GB HBM3 | 3,352 GB/s | 700W | ~$27,500 | ~70B FP8 |
The Pro 6000 sits between consumer and datacenter: 3x the VRAM of the RTX 5090 at 3x the price, with matching bandwidth. It matches H100 throughput on single-GPU workloads at a third of the cost. The tradeoff is no NVLink and lower bandwidth than HBM, which only matters for multi-GPU scaling.
At $88.50 per GB of VRAM, it is the cheapest professional GPU by a wide margin. The H100 costs $343.75/GB.
Use cases
Where it works well:
- Running 70B+ models locally for privacy, latency, or regulatory reasons
- Development and testing before datacenter deployment
- Serving multiple smaller models via MIG partitioning
- High-utilization batch inference (>50% utilization) where break-even versus cloud is favorable
- Air-gapped environments
Where it does not:
- Models that fit on an RTX 5090 (32 GB), which offers better price-performance for sub-32B models
- Large-scale serving with 4+ GPU tensor parallelism, where H100/H200 NVLink is far superior
- Low utilization (<25%), where cloud rental or API subscriptions cost less
- Replacing frontier closed models for quality-sensitive work. The best open model on 96 GB (Llama 3.3 70B or Qwen3-32B) scores 70-120 ELO below Claude Sonnet 4.6 on the LMSYS Arena. The quality gap is in the model weights, not the hardware
At Pulsed Media
Pulsed Media runs its own hardware in owned datacenter space. GPU inference cards like the RTX Pro 6000 are evaluated against the same cost-per-unit economics that drive Seedbox and dedicated server pricing. For most PM operations, cloud API subscriptions currently deliver better quality-per-euro than local GPU inference — but bulk classification and privacy-constrained workloads remain candidates for on-premise hardware.
See also
- Self-Hosting LLMs vs API — cost and quality comparison of this card against cloud API subscriptions
- NVMe — the storage interface used alongside GPU inference servers
- Seedbox — PM's primary product, where hardware economics directly apply
- RAID — storage redundancy used in the same infrastructure