Ollama Benchmark of LLM Inference on NVIDIA Pro 6000

With the arrival of NVIDIA’s Blackwell architecture in professional GPUs, the RTX Pro 6000 has become a compelling option for on-premise and hosted AI inference. It combines large VRAM capacity, strong compute throughput, and server-grade stability, making it suitable for running modern large language models on a single GPU.

This article presents a comparative inference benchmark of several widely used open-source large language models running on Ollama 0.13.5, tested on a single NVIDIA RTX Pro 6000 Blackwell server. The goal of this evaluation is not to compare model intelligence or reasoning depth, but rather to observe real-world inference efficiency, especially output speed, under consistent conditions.

Test Overview

The test environment was standardized across all models to ensure fairness and reproducibility.

GPU: NVIDIA RTX Pro 6000 (Blackwell architecture)

  • Microarchitecture: Blackwell
  • Compute capability: 12
  • CUDA Cores: 24,064
  • Tensor Cores: 852
  • GPU Memory: 96GB GDDR7

Backend Environment

  • Inference framework: Ollama 0.13.5
  • Quantization: 4-bit
  • Execution mode: Single-model, single-GPU
  • Prompt: Identical for all models

Although the prompt itself is simple, the length, structure, and detail of the generated answers vary significantly between models, which naturally affects token counts and generation time. This variation is considered part of the model’s real inference behavior rather than noise.

Input Settings

All models were asked the same simple question:

“What is GPU?”

This intentionally lightweight prompt helps isolate inference characteristics while avoiding complex multi-step reasoning that could obscure raw performance differences.

Models Included in the Benchmark

The following models were tested:

  • GPT-OSS (20B, 120B)
  • DeepSeek-R1 (32B, 70B)
  • Gemma 3 (27B)
  • LLaMA 3.3 (70B)
  • Qwen 3 (32B)
  • Qwen 2.5 (72B)

All models were loaded via Ollama using their official 4-bit quantized variants.

Pro 6000 Ollama Benchmark Data Display

Modelsgpt-ossgpt-ossdeepseek-r1deepseek-r1gemma3llama3.3qwen3qwen2.5
Parameters20b120b32b70b27b70b32b72b
Size (GB)1465204317432047
Quantization44444444
Running onOllama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5Ollama0.13.5
Downloading Speed(mb/s)120120120120120120120120
CPU UTL3%6.5%3%3.2%5.5%3.3%4.9%3.3%
RAM Rate7%9.5%10%3.4%5.7%3.2%4.1%3.6%
GPU UTL65%60%87%94%83%94%90%93%
GPU Memory33%77%98%41%18%41%20%45%
Total Duration(s)23.61824.11118.4015.3334.72819.4931.0513.29
Load Duration(ms)373.35377.14194.26254.36434.61251.80219.52218.40
Prompt Eval Count(tokens)70706612131332
Prompt Eval Duration(ms)14.92134.9214.21557.6713.3356.724.2466.61
Prompt Eval Rate(tokens/s)4.69518.830.42104.040.9229.173.06480.40
Eval Count(tokens)1246272922345411885791436356
Eval Duration(s)6.7320.3233.4614.16819.3218.1125.6612.21
Eval Rate (tokens/s)185.09134.2864.3132.0461.4931.9655.9629.15
Record real-time Pro6000 gpu server resource consumption data:

Why Eval Rate (tokens/s) Matters Most

Ollama reports multiple performance metrics, including:

  • Prompt evaluation time and rate (prefill stage)
  • Model load time
  • Evaluation duration
  • Eval rate (tokens per second) during generation

For practical AI applications such as chatbots, APIs, internal assistants, or agent-based systems, the generation phase dominates the user experience. Once the prompt is processed, the perceived responsiveness is almost entirely determined by how quickly tokens are produced.

For this reason, eval rate (tokens/s) is the most meaningful indicator in this benchmark and serves as the primary comparison metric in the analysis below.

Eval Rate Overview (tokens/s)

  • GPT-OSS 20B: ~185 tokens/s
  • GPT-OSS 120B: ~134 tokens/s
  • DeepSeek-R1 32B: ~64 tokens/s
  • Gemma 3 27B: ~61 tokens/s
  • Qwen 3 32B: ~56 tokens/s
  • Qwen 2.5 72B: ~29 tokens/s
  • LLaMA 3.3 70B: ~32 tokens/s
  • DeepSeek-R1 70B: ~32 tokens/s

These results clearly show that parameter count alone does not determine inference speed. Models with similar sizes can differ by several multiples in output throughput.

Observations and Analysis

1. GPT-OSS Shows Exceptional Inference Efficiency

The GPT-OSS models stand out in this benchmark. Even the 120B variant maintains a high eval rate, significantly outperforming many smaller or similarly sized models.

This behavior likely results from a combination of factors, including model architecture, quantization friendliness, and how well the model aligns with Ollama’s underlying inference backend. From an infrastructure perspective, GPT-OSS demonstrates that very large models can still deliver excellent throughput when properly optimized.

This makes GPT-OSS particularly suitable for high-concurrency inference services, interactive chat applications, and cost-sensitive deployments where output speed is a priority.

2. DeepSeek-R1 Prioritizes Computation Over Throughput

DeepSeek-R1 exhibits very high GPU utilization, reaching close to full usage on the RTX Pro 6000. However, its eval rate remains relatively modest, especially in the 70B configuration.

This suggests that each generated token requires more computation, which is consistent with DeepSeek-R1’s design focus on structured reasoning and deeper inference. In other words, the GPU is busy, but it is busy doing more work per token.

As a result, DeepSeek-R1 is better suited for reasoning-heavy workloads, research-oriented tasks, or low-concurrency scenarios where output quality is valued m

3. LLaMA 3.3 and Qwen Large Models Deliver Stable but Moderate Performance

LLaMA 3.3 70B and Qwen 2.5 72B show consistent and predictable behavior. GPU utilization is high, memory usage is well managed, and inference proceeds steadily, but output speed remains in the lower range compared to GPT-OSS.

These models benefit from mature ecosystems and strong tooling support, which makes them reliable choices for standardized deployments. However, on Blackwell GPUs, their inference performance does not yet fully capitalize on the available hardware capabilities.

4. GPU Memory Is Not the Primary Bottleneck

An important observation from this test is that GPU memory usage rarely reaches saturation, even for 70B and 120B models under 4-bit quantization.

This indicates that on the RTX Pro 6000 Blackwell, single-GPU inference capacity is no longer constrained by VRAM size for most mainstream open-source models. Instead, performance differences are driven primarily by:

  • Model architecture
  • Token-level compute complexity
  • Backend inference efficiency

Infrastructure-Level Implications

From an infrastructure and deployment perspective, these results lead to several practical conclusions:

  • Choosing the right model can have a larger impact than upgrading hardware.
  • Blackwell-class GPUs already provide ample headroom for single-card inference.
  • Mid-sized models (20B–32B) often deliver the best balance between speed and capability.
  • Very large models are viable on a single GPU, but efficiency varies widely by implementation.

For AI hosting providers and private deployments, this means that model selection is now a key performance optimization lever, not just GPU selection.

Conclusion: Pro6000 Is The Best GPU for Ollama Inference

This benchmark demonstrates that on a single NVIDIA RTX Pro 6000 Blackwell GPU, modern open-source LLMs can achieve highly diverse inference performance under identical conditions.

The most important takeaway is that eval rate (tokens per second) is the clearest indicator of real-world responsiveness. High parameter counts do not guarantee better throughput, and some models are significantly more inference-efficient than others.

As GPU architectures continue to advance, the competitive advantage in AI inference will increasingly depend on how well models, inference frameworks, and hardware architectures align, rather than on raw compute alone.

Future testing with multi-user concurrency, longer contexts, and alternative backends such as vLLM or TensorRT-LLM will further clarify how these models scale in production environments.

Only $479/mo, Rent Pro6000 Blackwell Server >

Tags:

Pro6000 benchmark, Blackwell GPU inference, Ollama benchmark, LLM inference performance, eval rate tokens per second, GPU inference server, open source LLM benchmark, AI inference hosting

Outline