NVIDIA RTX Pro 5000 Blackwell Server Ollama Inference Benchmark Report

As local and self-hosted large language models continue to gain adoption, lightweight inference frameworks such as Ollama have become increasingly popular for rapid deployment, testing, and private inference. In these scenarios, GPU efficiency, model loading behavior, and steady-state token generation speed are often more important than peak theoretical performance.

This report evaluates the NVIDIA RTX Pro 5000 Blackwell Server GPU as an inference platform using Ollama 0.13.5, focusing on real-world, single-GPU inference performance across a range of quantized large language models. The goal is to assess how well RTX Pro 5000 handles different model sizes, parameter counts, and architectural characteristics under consistent runtime conditions.

Test Overview

The test environment was standardized across all models to ensure fairness and reproducibility.

GPU: NVIDIA RTX Pro 5000 (Blackwell architecture)

  • Microarchitecture: Blackwell
  • Compute capability: 12
  • CUDA Cores: 14,080
  • Tensor Cores: 440
  • GPU Memory: 48GB GDDR7
  • FP32 Performance: 66.94 TFLOPS

Backend Environment

  • Inference framework: Ollama 0.13.5
  • Quantization: 4-bit
  • Execution mode: Single-model, single-GPU
  • Prompt: Identical for all models

Although the prompt itself is simple, the length, structure, and detail of the generated answers vary significantly between models, which naturally affects token counts and generation time. This variation is considered part of the model’s real inference behavior rather than noise.

Input Settings

All models were asked the same simple question:

“What is GPU?”

This intentionally lightweight prompt helps isolate inference characteristics while avoiding complex multi-step reasoning that could obscure raw performance differences.

Models Evaluated

The benchmark covers a wide range of model sizes and families, all running in 4-bit quantized form:

  • GPT-OSS 20B
  • DeepSeek-R1 (14B, 32B, 70B)
  • Gemma 3 27B
  • LLaMA 3.3 70B
  • Qwen 3 32B
  • Qwen 2.5 72B

Model sizes range from 9 GB to 47 GB, allowing evaluation of both mid-size and large models within the memory constraints of a single RTX Pro 5000.

Pro 5000 Ollama Benchmark Data Display

Modelsgpt-ossdeepseek-r1deepseek-r1deepseek-r1gemma3llama3.3qwen3qwen2.5
Parameters20b14b32b70b27b70b32b72b
Size (GB)149204317432047
Quantization44444444
Running onOllama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5
Downloading Speed(mb/s)6060606060606060
CPU UTL8.1%3.4%4.5%4.5%6.2%4.3%5.8%4.3%
RAM Rate5.8%3.5%4.3%3.4%6.1%3.5%4.5%3.5%
GPU UTL61%75%85%93%78%91%90%93%
Total Duration(s)3.325.988.2916.3451.1824.1642.8725.50
Load Duration(ms)458.72213.21233.49270.20525.45296.82265.99204.39
Prompt Eval Count(tokens)7066612131332
Prompt Eval Duration(ms)118.9519.8897.7780.9719.3360.526.1674.24
Prompt Eval Rate(tokens/s)588.480.3061.3674.100.62214.802.11431.01
Eval Count(tokens)36449236638614775811656558
Eval Duration(s)2.074.476.8314.8028.9022.3034.8023.52
Eval Rate (tokens/s)175.26110.0153.5326.0751.1026.0547.5823.72
Record real-time Pro5000 gpu server resource consumption data:

GPU Utilization and Resource Efficiency

One of the most notable results is the high GPU utilization achieved by large models, even under Ollama’s relatively simple execution model:

  • GPT-OSS 20B maintains ~61% GPU utilization.
  • 32B–70B class models consistently reach 85%–93% GPU utilization.
  • Qwen 2.5 72B and DeepSeek-R1 70B both sustain ~93% GPU usage.

This indicates that RTX Pro 5000 is able to remain compute-bound rather than idle, even when running very large INT4-quantized models. CPU and RAM utilization remain low across all tests, confirming that inference is clearly GPU-dominated.


Model Loading and Prompt Evaluation

Load Duration

Model load times scale primarily with model size:

  • Small to mid-size models (≤20 GB) load within ~200–270 ms.
  • Larger models (40+ GB) show increased load times, typically ~300–525 ms.

Despite the size difference, all models load quickly enough for practical use, suggesting that RTX Pro 5000’s memory bandwidth and Ollama’s loading mechanism are well-aligned.

Prompt Evaluation

Prompt evaluation latency varies significantly depending on model architecture and token count:

  • GPT-OSS 20B processes short prompts extremely quickly, with prompt evaluation completed in ~119 ms.
  • Larger reasoning-oriented models such as DeepSeek-R1 show higher prompt evaluation overhead, particularly for longer internal reasoning paths.
  • Prompt evaluation token rates range from very low (DeepSeek-R1 14B) to extremely high (GPT-OSS 20B), reflecting architectural differences rather than hardware limitations.

Token Generation Performance

Why Eval Rate (tokens/s) Matters Most

Ollama reports multiple performance metrics, including:

  • Prompt evaluation time and rate (prefill stage)
  • Model load time
  • Evaluation duration
  • Eval rate (tokens per second) during generation

For practical AI applications such as chatbots, APIs, internal assistants, or agent-based systems, the generation phase dominates the user experience. Once the prompt is processed, the perceived responsiveness is almost entirely determined by how quickly tokens are produced.

For this reason, eval rate (tokens/s) is the most meaningful indicator in this benchmark and serves as the primary comparison metric in the analysis below.

Evaluation Token Rate

The most relevant metric for user-facing inference is sustained token generation speed:

  • GPT-OSS 20B leads with ~175 tokens/s, benefiting from aggressive INT4 quantization and efficient architecture.
  • DeepSeek-R1 14B achieves ~110 tokens/s, showing strong performance relative to size.
  • 32B-class models stabilize around 45–55 tokens/s.
  • 70B–72B models operate in the 23–26 tokens/s range.

These results demonstrate that RTX Pro 5000 can comfortably run 70B-class INT4 models, albeit with reduced generation speed compared to smaller models.


End-to-End Inference Duration

Total inference duration increases predictably with model size and generated token count:

  • GPT-OSS 20B completes a full inference cycle in just over 3 seconds.
  • Mid-size models (27B–32B) complete within 8–25 seconds.
  • Large models (70B+) range between 16–26 seconds, depending on output length and internal reasoning behavior.

Despite longer runtimes, inference remains stable, with no evidence of throttling or resource starvation.


Key Findings

  • RTX Pro 5000 Blackwell is capable of running up to 70B-class models in INT4 quantization on a single GPU.
  • GPU utilization remains consistently high, indicating efficient hardware usage under Ollama.
  • Quantization plays a critical role in enabling large-model inference within practical latency bounds.
  • GPT-OSS 20B offers the best overall performance, combining high token throughput with low latency.
  • Larger reasoning-focused models trade speed for capability but remain operationally viable.

Conclusion

The NVIDIA RTX Pro 5000 Blackwell Server GPU demonstrates strong and reliable inference performance when paired with Ollama, particularly for quantized large language models. It strikes a practical balance between compute capability, memory capacity, and efficiency, making it suitable for private inference, development environments, and small-scale production deployments.

For users prioritizing speed and responsiveness, 20B–32B INT4 models represent an optimal choice. For advanced reasoning tasks, 70B-class models are fully usable, provided that lower token generation speed is acceptable.

Overall, RTX Pro 5000 positions itself as a capable single-GPU inference solution for modern LLM workloads in Ollama-based environments.

Only $349/mo, Rent Pro5000 Blackwell Server >

Tags:

RTX Pro 5000 Blackwell, Pro5000 Ollama benchmark, Ollama inference GPU, INT4 LLM inference, Blackwell GPU inference, RTX Pro 5000 server, Ollama benchmark, 70B model inference GPU, Local LLM inference server

Outline