Test Overview
The test environment was standardized across all models to ensure fairness and reproducibility.
GPU: NVIDIA RTX Pro 5000 (Blackwell architecture)
- Microarchitecture: Blackwell
- Compute capability: 12
- CUDA Cores: 14,080
- Tensor Cores: 440
- GPU Memory: 48GB GDDR7
- FP32 Performance: 66.94 TFLOPS
Backend Environment
- Inference framework: Ollama 0.13.5
- Quantization: 4-bit
- Execution mode: Single-model, single-GPU
- Prompt: Identical for all models
Although the prompt itself is simple, the length, structure, and detail of the generated answers vary significantly between models, which naturally affects token counts and generation time. This variation is considered part of the model’s real inference behavior rather than noise.
Input Settings
All models were asked the same simple question:
“What is GPU?”
This intentionally lightweight prompt helps isolate inference characteristics while avoiding complex multi-step reasoning that could obscure raw performance differences.
Models Evaluated
The benchmark covers a wide range of model sizes and families, all running in 4-bit quantized form:
- GPT-OSS 20B
- DeepSeek-R1 (14B, 32B, 70B)
- Gemma 3 27B
- LLaMA 3.3 70B
- Qwen 3 32B
- Qwen 2.5 72B
Model sizes range from 9 GB to 47 GB, allowing evaluation of both mid-size and large models within the memory constraints of a single RTX Pro 5000.
Pro 5000 Ollama Benchmark Data Display
| Models | gpt-oss | deepseek-r1 | deepseek-r1 | deepseek-r1 | gemma3 | llama3.3 | qwen3 | qwen2.5 |
|---|---|---|---|---|---|---|---|---|
| Parameters | 20b | 14b | 32b | 70b | 27b | 70b | 32b | 72b |
| Size (GB) | 14 | 9 | 20 | 43 | 17 | 43 | 20 | 47 |
| Quantization | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| Running on | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 | Ollama0.13.5 |
| Downloading Speed(mb/s) | 60 | 60 | 60 | 60 | 60 | 60 | 60 | 60 |
| CPU UTL | 8.1% | 3.4% | 4.5% | 4.5% | 6.2% | 4.3% | 5.8% | 4.3% |
| RAM Rate | 5.8% | 3.5% | 4.3% | 3.4% | 6.1% | 3.5% | 4.5% | 3.5% |
| GPU UTL | 61% | 75% | 85% | 93% | 78% | 91% | 90% | 93% |
| Total Duration(s) | 3.3 | 25.98 | 8.29 | 16.34 | 51.18 | 24.16 | 42.87 | 25.50 |
| Load Duration(ms) | 458.72 | 213.21 | 233.49 | 270.20 | 525.45 | 296.82 | 265.99 | 204.39 |
| Prompt Eval Count(tokens) | 70 | 6 | 6 | 6 | 12 | 13 | 13 | 32 |
| Prompt Eval Duration(ms) | 118.95 | 19.88 | 97.77 | 80.97 | 19.33 | 60.52 | 6.16 | 74.24 |
| Prompt Eval Rate(tokens/s) | 588.48 | 0.30 | 61.36 | 74.10 | 0.62 | 214.80 | 2.11 | 431.01 |
| Eval Count(tokens) | 364 | 492 | 366 | 386 | 1477 | 581 | 1656 | 558 |
| Eval Duration(s) | 2.07 | 4.47 | 6.83 | 14.80 | 28.90 | 22.30 | 34.80 | 23.52 |
| Eval Rate (tokens/s) | 175.26 | 110.01 | 53.53 | 26.07 | 51.10 | 26.05 | 47.58 | 23.72 |
GPU Utilization and Resource Efficiency
One of the most notable results is the high GPU utilization achieved by large models, even under Ollama’s relatively simple execution model:
- GPT-OSS 20B maintains ~61% GPU utilization.
- 32B–70B class models consistently reach 85%–93% GPU utilization.
- Qwen 2.5 72B and DeepSeek-R1 70B both sustain ~93% GPU usage.
This indicates that RTX Pro 5000 is able to remain compute-bound rather than idle, even when running very large INT4-quantized models. CPU and RAM utilization remain low across all tests, confirming that inference is clearly GPU-dominated.
Model Loading and Prompt Evaluation
Load Duration
Model load times scale primarily with model size:
- Small to mid-size models (≤20 GB) load within ~200–270 ms.
- Larger models (40+ GB) show increased load times, typically ~300–525 ms.
Despite the size difference, all models load quickly enough for practical use, suggesting that RTX Pro 5000’s memory bandwidth and Ollama’s loading mechanism are well-aligned.
Prompt Evaluation
Prompt evaluation latency varies significantly depending on model architecture and token count:
- GPT-OSS 20B processes short prompts extremely quickly, with prompt evaluation completed in ~119 ms.
- Larger reasoning-oriented models such as DeepSeek-R1 show higher prompt evaluation overhead, particularly for longer internal reasoning paths.
- Prompt evaluation token rates range from very low (DeepSeek-R1 14B) to extremely high (GPT-OSS 20B), reflecting architectural differences rather than hardware limitations.
Token Generation Performance
Why Eval Rate (tokens/s) Matters Most
Ollama reports multiple performance metrics, including:
- Prompt evaluation time and rate (prefill stage)
- Model load time
- Evaluation duration
- Eval rate (tokens per second) during generation
For practical AI applications such as chatbots, APIs, internal assistants, or agent-based systems, the generation phase dominates the user experience. Once the prompt is processed, the perceived responsiveness is almost entirely determined by how quickly tokens are produced.
For this reason, eval rate (tokens/s) is the most meaningful indicator in this benchmark and serves as the primary comparison metric in the analysis below.
Evaluation Token Rate
The most relevant metric for user-facing inference is sustained token generation speed:
- GPT-OSS 20B leads with ~175 tokens/s, benefiting from aggressive INT4 quantization and efficient architecture.
- DeepSeek-R1 14B achieves ~110 tokens/s, showing strong performance relative to size.
- 32B-class models stabilize around 45–55 tokens/s.
- 70B–72B models operate in the 23–26 tokens/s range.
These results demonstrate that RTX Pro 5000 can comfortably run 70B-class INT4 models, albeit with reduced generation speed compared to smaller models.
End-to-End Inference Duration
Total inference duration increases predictably with model size and generated token count:
- GPT-OSS 20B completes a full inference cycle in just over 3 seconds.
- Mid-size models (27B–32B) complete within 8–25 seconds.
- Large models (70B+) range between 16–26 seconds, depending on output length and internal reasoning behavior.
Despite longer runtimes, inference remains stable, with no evidence of throttling or resource starvation.
Key Findings
- RTX Pro 5000 Blackwell is capable of running up to 70B-class models in INT4 quantization on a single GPU.
- GPU utilization remains consistently high, indicating efficient hardware usage under Ollama.
- Quantization plays a critical role in enabling large-model inference within practical latency bounds.
- GPT-OSS 20B offers the best overall performance, combining high token throughput with low latency.
- Larger reasoning-focused models trade speed for capability but remain operationally viable.
Conclusion
The NVIDIA RTX Pro 5000 Blackwell Server GPU demonstrates strong and reliable inference performance when paired with Ollama, particularly for quantized large language models. It strikes a practical balance between compute capability, memory capacity, and efficiency, making it suitable for private inference, development environments, and small-scale production deployments.
For users prioritizing speed and responsiveness, 20B–32B INT4 models represent an optimal choice. For advanced reasoning tasks, 70B-class models are fully usable, provided that lower token generation speed is acceptable.
Overall, RTX Pro 5000 positions itself as a capable single-GPU inference solution for modern LLM workloads in Ollama-based environments.
Only $349/mo, Rent Pro5000 Blackwell Server >
RTX Pro 5000 Blackwell, Pro5000 Ollama benchmark, Ollama inference GPU, INT4 LLM inference, Blackwell GPU inference, RTX Pro 5000 server, Ollama benchmark, 70B model inference GPU, Local LLM inference server
