Test Overview
Server Configs:
- Price: $999.0/month
- CPU: Dual Gold 6148
- RAM: 256GB RAM
- Storage: 240GB SSD + 2TB NVMe + 8TB SATA
- Network: 1Gbps
- OS: Ubuntu 22.0
A Single 5090 Details:
- GPU: Nvidia GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- Compute Capability: 12.0
- CUDA Cores: 20,480
- Tensor Cores: 680
- Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Framework:
- Ollama 0.6.5
This configuration makes it an ideal 2*RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.
Benchmarking Ollama Results on Nvidia 2*RTX5090
Models | deepseek-r1 | llama3.3 | qwen2.5 | qwen |
---|---|---|---|---|
Parameters | 70b | 70b | 72b | 110b |
Size (GB) | 43 | 43 | 47 | 63 |
Quantization | 4 | 4 | 4 | 4 |
Running on | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 | Ollama0.6.5 |
Downloading Speed(mb/s) | 113 | 113 | 113 | 113 |
CPU Rate | 1.3% | 1.3% | 1.3% | 33-35% |
RAM Rate | 2.1% | 2.1% | 2.1% | 2.1% |
GPU Memory (2 cards) | 70.9%, 70.4% | 71%, 75% | 77.9%, 77.6% | 94%, 91% |
GPU UTL (2 cards) | 45%, 48% | 47%, 45% | 45%, 48% | 20%, 20% |
Eval Rate(tokens/s) | 27.03 | 26.85 | 24.15 | 7.22 |
Analysis & Insights
1. Best Performance per Dollar
2. More Affordable Than H100
3. 64GB VRAM Limitation
- While you can comfortably run 70B and 72B models with full GPU utilization, the 110B Qwen model struggled: GPU usage capped at 20%, Eval rate dropped to just 7.22 tokens/s.
- This highlights that 64GB VRAM is not enough for smooth inference of 110B+ LLMs, even with quantization.
2*RTX5090 vs. 2*A100 vs. H100 for 70b LLMs on Ollama
Metric | Nvidia 2*RTX5090 | Nvidia H100 | Nvidia 2*A100 40GB |
---|---|---|---|
Models | llama3.3:70b | llama3.3:70b | llama3.3:70b |
Eval Rate(tokens/s) | 26.85 | 24.34 | 18.91 |
2*RTX 5090 GPU Hosting for LLMs
Enterprise GPU Dedicated Server - RTX A6000
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
- Optimally running AI, deep learning, data visualization, HPC, etc.
Multi-GPU Dedicated Server- 2xRTX 5090
- 256GB RAM
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Multi-GPU Dedicated Server - 2xA100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Free NVLink Included
- A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.
Enterprise GPU Dedicated Server - H100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia H100
- Microarchitecture: Hopper
- CUDA Cores: 14,592
- Tensor Cores: 456
- GPU Memory: 80GB HBM2e
- FP32 Performance: 183TFLOPS
Conclusion: 2*RTX 5090 is the Ideal Choice for 70B LLMs
Whether you’re looking for the best GPU for LLaMA 3.3 70B, cheapest setup to run DeepSeek-R1 70B, or Ollama 5090 hosting benchmarks, the verdict is clear: 👉 2× RTX 5090 is the new sweet spot for LLM hosting up to 72B.
Nvidia RTX 5090 Hosting, rtx 5090 ollama, dual rtx 5090 benchmark, rtx 5090 vs h100 inference, best gpu for 70b llm, 2x rtx 5090 llm inference, deepseek 70b benchmark, llama3 70b ollama, huggingface 70b gpu, ollama 5090 results, cheap gpu for large language models, 110b llm hardware requirements