GPU Server Promotion, Up to 59% OFF, Order Now>



2× RTX 5090 Ollama Benchmark: The Best Value GPU for 70B LLM Inference

Looking for the fastest and most cost-effective way to host 70B parameter large language models (LLMs) on your own infrastructure? Meet the dual RTX 5090 setup – the latest generation of NVIDIA consumer-grade GPUs that outperform the A100, rival the H100, and come in at a fraction of the cost.

In this benchmark report, we evaluate the performance of 2× RTX 5090 GPUs running DeepSeek-R1 70B, LLaMA 3.3 70B, and Qwen 2.5 72B & 110B models using Ollama 0.6.5. If you're researching RTX 5090 LLM inference, RTX 5090 Ollama benchmarks, or a cheaper alternative to A100/H100, this is the analysis you need.

Test Overview

Server Configs:

Price: $999.0/month
CPU: Dual Gold 6148
RAM: 256GB RAM
Storage: 240GB SSD + 2TB NVMe + 8TB SATA
Network: 1Gbps
OS: Ubuntu 22.0

A Single 5090 Details:

GPU: Nvidia GeForce RTX 5090
Microarchitecture: Ada Lovelace
Compute Capability: 12.0
CUDA Cores: 20,480
Tensor Cores: 680
Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Framework:

Ollama 0.6.5

This configuration makes it an ideal 2*RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.

Benchmarking Ollama Results on Nvidia 2*RTX5090

Models	deepseek-r1	llama3.3	qwen2.5	qwen
Parameters	70b	70b	72b	110b
Size (GB)	43	43	47	63
Quantization	4	4	4	4
Running on	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5	Ollama0.6.5
Downloading Speed(mb/s)	113	113	113	113
CPU Rate	1.3%	1.3%	1.3%	33-35%
RAM Rate	2.1%	2.1%	2.1%	2.1%
GPU Memory (2 cards)	70.9%, 70.4%	71%, 75%	77.9%, 77.6%	94%, 91%
GPU UTL (2 cards)	45%, 48%	47%, 45%	45%, 48%	20%, 20%
Eval Rate(tokens/s)	27.03	26.85	24.15	7.22

Record real-time 2*RTX5090 gpu server resource consumption data:

Analysis & Insights

1. Best Performance per Dollar

The dual RTX 5090 configuration delivers Eval Rates up to 27 tokens/s for 70B models — matching H100 speeds while costing significantly less.

2. More Affordable Than H100

RTX 5090 (Consumer GPU): ~35–45% of the cost of H100. Delivers comparable results when hosting quantized 70B models with Ollama.

3. 64GB VRAM Limitation

While you can comfortably run 70B and 72B models with full GPU utilization, the 110B Qwen model struggled: GPU usage capped at 20%, Eval rate dropped to just 7.22 tokens/s.
This highlights that 64GB VRAM is not enough for smooth inference of 110B+ LLMs, even with quantization.

2RTX5090 vs. 2A100 vs. H100 for 70b LLMs on Ollama

When comparing the performance of the LLaMA 3.3 70B model on Ollama across three high-end GPU configurations, the results may surprise you:

Metric	Nvidia 2*RTX5090	Nvidia H100	Nvidia 2*A100 40GB
Models	llama3.3:70b	llama3.3:70b	llama3.3:70b
Eval Rate(tokens/s)	26.85	24.34	18.91

The dual RTX 5090 setup outperforms both H100 and 2× A100 40GB in terms of raw Eval Rate — delivering the highest tokens-per-second output for this 70B model in Ollama. This positions the RTX 5090 not just as a cost-effective choice, but as the performance leader in this category — ideal for developers and businesses running high-parameter LLMs without access to expensive enterprise-grade GPUs.

2*RTX 5090 GPU Hosting for LLMs

Our dedicated 2*RTX 5090 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 64GB memory, it can handle ollama models up to 100B parameters efficiently.

Enterprise GPU Dedicated Server - RTX A6000

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Optimally running AI, deep learning, data visualization, HPC, etc.

New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

$ 999.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual Gold 6148
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: 2 x GeForce RTX 5090
Microarchitecture: Ada Lovelace
CUDA Cores: 20,480
Tensor Cores: 680
GPU Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Multi-GPU Dedicated Server - 2xA100

$ 1099.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS
Free NVLink Included

A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.

Enterprise GPU Dedicated Server - H100

$ 2099.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia H100
Microarchitecture: Hopper
CUDA Cores: 14,592
Tensor Cores: 456
GPU Memory: 80GB HBM2e
FP32 Performance: 183TFLOPS

Conclusion: 2*RTX 5090 is the Ideal Choice for 70B LLMs

Whether you’re looking for the best GPU for LLaMA 3.3 70B, cheapest setup to run DeepSeek-R1 70B, or Ollama 5090 hosting benchmarks, the verdict is clear: 👉 2× RTX 5090 is the new sweet spot for LLM hosting up to 72B.

Tags:

Nvidia RTX 5090 Hosting, rtx 5090 ollama, dual rtx 5090 benchmark, rtx 5090 vs h100 inference, best gpu for 70b llm, 2x rtx 5090 llm inference, deepseek 70b benchmark, llama3 70b ollama, huggingface 70b gpu, ollama 5090 results, cheap gpu for large language models, 110b llm hardware requirements

2× RTX 5090 Ollama Benchmark: The Best Value GPU for 70B LLM Inference

Test Overview

Server Configs:

A Single 5090 Details:

Framework:

Benchmarking Ollama Results on Nvidia 2*RTX5090

Analysis & Insights

1. Best Performance per Dollar

2. More Affordable Than H100

3. 64GB VRAM Limitation

2*RTX5090 vs. 2*A100 vs. H100 for 70b LLMs on Ollama

2*RTX 5090 GPU Hosting for LLMs

Conclusion: 2*RTX 5090 is the Ideal Choice for 70B LLMs

2RTX5090 vs. 2A100 vs. H100 for 70b LLMs on Ollama