Test Overview
Server Configs:
- CPU: Dual Gold 6148
- RAM: 256GB RAM
- Storage: 240GB SSD + 2TB NVMe + 8TB SATA
- Network: 1Gbps
- OS: Ubuntu 22.0
A Single 5090 Details:
- GPU: Nvidia GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- Compute Capability: 12.0
- CUDA Cores: 20,480
- Tensor Cores: 680
- Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Framework:
- Ollama 0.6.5
This configuration makes it an ideal RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.
Benchmarking Ollama Results on Nvidia RTX5090
Models | gemma3 | gemma3 | llama3.1 | deepseek-r1 | deepseek-r1 | qwen2.5 | qwen2.5 | qwq |
---|---|---|---|---|---|---|---|---|
Parameters | 12b | 27b | 8b | 14b | 32b | 14b | 32b | 32b |
Size (GB) | 8.1 | 17 | 4.9 | 9.0 | 20 | 9.0 | 20 | 20 |
Quantization | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Running on | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 |
Downloading Speed(mb/s) | 113 | 113 | 113 | 113 | 113 | 113 | 113 | 113 |
CPU Rate | 6.9% | 7.0% | 0.2% | 1.0% | 1.7% | 1.5% | 1.4% | 1.4% |
RAM Rate | 2.8% | 3.4% | 3.5% | 3.7% | 3.6% | 3.6% | 3.6% | 3.1% |
GPU Memory | 32.8% | 82% | 82% | 66.3% | 95% | 66.5% | 95% | 94% |
GPU UTL | 53% | 66% | 15% | 65% | 75% | 68% | 80% | 88% |
Eval Rate(tokens/s) | 70.37 | 47.33 | 149.95 | 89.13 | 45.51 | 89.93 | 45.07 | 57.17 |
Record real-time RTX5090 gpu server resource consumption data:
Analysis & Insights
1. Next-Generation Extreme Performance
RTX5090 has the fastest single-GPU evaluation speed for 32B models, is more economical than H100 or A100 configurations, can efficiently handle gemma3, qwen2.5, deepseek-r1 and llama3 models, and is very suitable for Ollama-based LLM inference settings.
2. 32GB VRAM Limitation
While the RTX 5090 excels at running 32B models, it cannot run 70B or 110B models in single-GPU mode due to its 32GB memory cap. You’ll need 2× RTX 5090s to run models like llama3:70b.
3. Pay Attention to Cooling to Get a Longer Life
Keep an eye on GPU temperatures. While testing, models may push the GPU beyond 80°C. Make sure to run cooling fans if needed for long sessions.
RTX5090 vs. H100 vs. A100 vs. RTX4090 vs. A6000 for 32b LLMs on Ollama
When comparing the performance of the deepseek-r1:32b model on Ollama across 5 high-end GPU configurations, the results may surprise you:
GPU | Nvidia RTX5090 | Nvidia H100 | Nvidia A100 40GB | Nvidia RTX4090 | Nvidia RTX A6000 |
---|---|---|---|---|---|
Models | deepseek-r1:32b | deepseek-r1:32b | deepseek-r1:32b | deepseek-r1:32b | deepseek-r1:32b |
Eval Rate(tokens/s) | 45.51 | 45.36 | 35.01 | 34.22 | 26.23 |
The RTX 5090 outperforms the A100 and even slightly edges out the H100 in single-LLM evaluation speed for 32B models—all while being significantly cheaper.
RTX 5090 GPU Hosting for 32B LLMs
Our dedicated RTX 5090 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 32GB memory, it can handle ollama models up to 40B parameters efficiently.
Enterprise GPU Dedicated Server - RTX 4090
$ 409.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: GeForce RTX 4090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 16,384
- Tensor Cores: 512
- GPU Memory: 24 GB GDDR6X
- FP32 Performance: 82.6 TFLOPS
- Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to May 27
Enterprise GPU Dedicated Server - RTX A6000
$ 329.00/mo
40% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
- Optimally running AI, deep learning, data visualization, HPC, etc.
Flash Sale to May 27
Enterprise GPU Dedicated Server - A100
$ 469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival
Multi-GPU Dedicated Server- 2xRTX 5090
$ 999.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Conclusion: RTX 5090 is The Best Single-GPU for Ollama LLMs Under 40B
RTX 5090 is best suited for LLM up to 32B, such as deepseek-r1, qwen2.5, gemma3, llama3. Models above 70B can be inferred using dual cards. Choose RTX 5090 to get the highest Ollama performance at a cheap price.
Tags:
Nvidia RTX 5090 Hosting, RTX 5090 Ollama benchmark, RTX 5090 for 32B LLMs, best GPU for 32B inference, ollama RTX 5090, single-GPU LLM hosting, cheap GPU for LLMs, H100 vs RTX 5090, A100 vs RTX 5090, RTX 5090 LLM inference