RTX 5090 Ollama Benchmark: A New Consumer GPU with Extreme Performance

If you're searching for a powerful, cost-effective GPU for running 32B large language models (LLMs) on Ollama, look no further than the Nvidia RTX 5090. With its cutting-edge architecture, the RTX 5090 delivers performance on par with the Nvidia H100—yet at a much more affordable price. In this benchmark report, we’ll show why the RTX 5090 is the best single-GPU option for 32B LLM inference, and how it compares against other GPUs like the A100, H100, and RTX 4090.

Test Overview

Server Configs:

  • CPU: Dual Gold 6148
  • RAM: 256GB RAM
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 1Gbps
  • OS: Ubuntu 22.0

A Single 5090 Details:

  • GPU: Nvidia GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • Compute Capability: 12.0
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Framework:

  • Ollama 0.6.5

This configuration makes it an ideal RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.

Benchmarking Ollama Results on Nvidia RTX5090

Modelsgemma3gemma3llama3.1deepseek-r1deepseek-r1qwen2.5qwen2.5qwq
Parameters12b27b8b14b32b14b32b32b
Size (GB)8.1174.99.0209.02020
Quantization44444444
Running onOllama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5
Downloading Speed(mb/s)113113113113113113113113
CPU Rate6.9%7.0%0.2%1.0%1.7%1.5%1.4%1.4%
RAM Rate2.8%3.4%3.5%3.7%3.6%3.6%3.6%3.1%
GPU Memory32.8%82%82%66.3%95%66.5%95%94%
GPU UTL53%66%15%65%75%68%80%88%
Eval Rate(tokens/s)70.3747.33149.9589.1345.5189.9345.0757.17
Record real-time RTX5090 gpu server resource consumption data:

Analysis & Insights

1. Next-Generation Extreme Performance

RTX5090 has the fastest single-GPU evaluation speed for 32B models, is more economical than H100 or A100 configurations, can efficiently handle gemma3, qwen2.5, deepseek-r1 and llama3 models, and is very suitable for Ollama-based LLM inference settings.

2. 32GB VRAM Limitation

While the RTX 5090 excels at running 32B models, it cannot run 70B or 110B models in single-GPU mode due to its 32GB memory cap. You’ll need 2× RTX 5090s to run models like llama3:70b.

3. Pay Attention to Cooling to Get a Longer Life

Keep an eye on GPU temperatures. While testing, models may push the GPU beyond 80°C. Make sure to run cooling fans if needed for long sessions.

RTX5090 vs. H100 vs. A100 vs. RTX4090 vs. A6000 for 32b LLMs on Ollama

When comparing the performance of the deepseek-r1:32b model on Ollama across 5 high-end GPU configurations, the results may surprise you:
GPUNvidia RTX5090Nvidia H100Nvidia A100 40GBNvidia RTX4090Nvidia RTX A6000
Modelsdeepseek-r1:32bdeepseek-r1:32bdeepseek-r1:32bdeepseek-r1:32bdeepseek-r1:32b
Eval Rate(tokens/s)45.5145.3635.0134.2226.23
The RTX 5090 outperforms the A100 and even slightly edges out the H100 in single-LLM evaluation speed for 32B models—all while being significantly cheaper.

RTX 5090 GPU Hosting for 32B LLMs

Our dedicated RTX 5090 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 32GB memory, it can handle ollama models up to 40B parameters efficiently.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to May 6

Enterprise GPU Dedicated Server - RTX A6000

329.00/mo
40% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
Flash Sale to May 6

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

999.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual Gold 6148
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • GPU Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Conclusion: RTX 5090 is The Best Single-GPU for Ollama LLMs Under 40B

RTX 5090 is best suited for LLM up to 32B, such as deepseek-r1, qwen2.5, gemma3, llama3. Models above 70B can be inferred using dual cards. Choose RTX 5090 to get the highest Ollama performance at a cheap price.

Tags:

Nvidia RTX 5090 Hosting, RTX 5090 Ollama benchmark, RTX 5090 for 32B LLMs, best GPU for 32B inference, ollama RTX 5090, single-GPU LLM hosting, cheap GPU for LLMs, H100 vs RTX 5090, A100 vs RTX 5090, RTX 5090 LLM inference