GPU Server Promotion, Up to 59% OFF, Order Now>



RTX 5090 Ollama Benchmark: A New Consumer GPU with Extreme Performance

If you're searching for a powerful, cost-effective GPU for running 32B large language models (LLMs) on Ollama, look no further than the Nvidia RTX 5090. With its cutting-edge architecture, the RTX 5090 delivers performance on par with the Nvidia H100—yet at a much more affordable price. In this benchmark report, we’ll show why the RTX 5090 is the best single-GPU option for 32B LLM inference, and how it compares against other GPUs like the A100, H100, and RTX 4090.

Test Overview

Server Configs:

CPU: Dual Gold 6148
RAM: 256GB RAM
Storage: 240GB SSD + 2TB NVMe + 8TB SATA
Network: 1Gbps
OS: Ubuntu 22.0

A Single 5090 Details:

GPU: Nvidia GeForce RTX 5090
Microarchitecture: Ada Lovelace
Compute Capability: 12.0
CUDA Cores: 20,480
Tensor Cores: 680
Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Framework:

Ollama 0.6.5

This configuration makes it an ideal RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.

Benchmarking Ollama Results on Nvidia RTX5090

Models	gemma3	gemma3	llama3.1	deepseek-r1	deepseek-r1	qwen2.5	qwen2.5	qwq
Parameters	12b	27b	8b	14b	32b	14b	32b	32b
Size (GB)	8.1	17	4.9	9.0	20	9.0	20	20
Quantization	4	4	4	4	4	4	4	4
Running on	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5
Downloading Speed(mb/s)	113	113	113	113	113	113	113	113
CPU Rate	6.9%	7.0%	0.2%	1.0%	1.7%	1.5%	1.4%	1.4%
RAM Rate	2.8%	3.4%	3.5%	3.7%	3.6%	3.6%	3.6%	3.1%
GPU Memory	32.8%	82%	82%	66.3%	95%	66.5%	95%	94%
GPU UTL	53%	66%	15%	65%	75%	68%	80%	88%
Eval Rate(tokens/s)	70.37	47.33	149.95	89.13	45.51	89.93	45.07	57.17

Record real-time RTX5090 gpu server resource consumption data:

Analysis & Insights

1. Next-Generation Extreme Performance

RTX5090 has the fastest single-GPU evaluation speed for 32B models, is more economical than H100 or A100 configurations, can efficiently handle gemma3, qwen2.5, deepseek-r1 and llama3 models, and is very suitable for Ollama-based LLM inference settings.

2. 32GB VRAM Limitation

While the RTX 5090 excels at running 32B models, it cannot run 70B or 110B models in single-GPU mode due to its 32GB memory cap. You’ll need 2× RTX 5090s to run models like llama3:70b.

3. Pay Attention to Cooling to Get a Longer Life

Keep an eye on GPU temperatures. While testing, models may push the GPU beyond 80°C. Make sure to run cooling fans if needed for long sessions.

RTX5090 vs. H100 vs. A100 vs. RTX4090 vs. A6000 for 32b LLMs on Ollama

When comparing the performance of the deepseek-r1:32b model on Ollama across 5 high-end GPU configurations, the results may surprise you:

GPU	Nvidia RTX5090	Nvidia H100	Nvidia A100 40GB	Nvidia RTX4090	Nvidia RTX A6000
Models	deepseek-r1:32b	deepseek-r1:32b	deepseek-r1:32b	deepseek-r1:32b	deepseek-r1:32b
Eval Rate(tokens/s)	45.51	45.36	35.01	34.22	26.23

The RTX 5090 outperforms the A100 and even slightly edges out the H100 in single-LLM evaluation speed for 32B models—all while being significantly cheaper.

RTX 5090 GPU Hosting for 32B LLMs

Our dedicated RTX 5090 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 32GB memory, it can handle ollama models up to 40B parameters efficiently.

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 4090
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Flash Sale to May 27

Enterprise GPU Dedicated Server - RTX A6000

$ 329.00/mo

40% OFF Recurring (Was $549.00)

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Optimally running AI, deep learning, data visualization, HPC, etc.

Flash Sale to May 27

Enterprise GPU Dedicated Server - A100

$ 469.00/mo

41% OFF Recurring (Was $799.00)

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

$ 999.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual Gold 6148
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: 2 x GeForce RTX 5090
Microarchitecture: Ada Lovelace
CUDA Cores: 20,480
Tensor Cores: 680
GPU Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Conclusion: RTX 5090 is The Best Single-GPU for Ollama LLMs Under 40B

RTX 5090 is best suited for LLM up to 32B, such as deepseek-r1, qwen2.5, gemma3, llama3. Models above 70B can be inferred using dual cards. Choose RTX 5090 to get the highest Ollama performance at a cheap price.

Tags:

Nvidia RTX 5090 Hosting, RTX 5090 Ollama benchmark, RTX 5090 for 32B LLMs, best GPU for 32B inference, ollama RTX 5090, single-GPU LLM hosting, cheap GPU for LLMs, H100 vs RTX 5090, A100 vs RTX 5090, RTX 5090 LLM inference