Benchmarking LLMs on NVIDIA RTX 4090 GPU Server with Ollama: An In-Depth Analysis

In the race to optimize Large Language Model (LLM) performance, hardware efficiency plays a pivotal role. The NVIDIA RTX 4090, a powerhouse GPU featuring 24GB GDDR6X memory, paired with Ollama, a cutting-edge platform for running LLMs, provides a compelling solution for developers and enterprises. This article dives into the RTX 4090 benchmark and Ollama benchmark, evaluating its capabilities for hosting and running various LLMs(deepseek-r1, llama, qwen, gemma, etc.) on a GPU server.

Server Specifications

Here’s the detailed specification of the RTX 4090 hosting server used in our tests:

Server Configuration:

  • Price: $409.00/month
  • CPU: Dual 18-Core E5-2697v4 (36 cores, 72 threads)
  • RAM: 256GB
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 100Mbps-1Gbps connection
  • OS: Windows 11 Pro
  • Software: Ollama versions 0.5.4

GPU Details:

  • GPU: Nvidia GeForce RTX 4090
  • Compute Capability: 8.9
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS

With 82.6 TFLOPS FP32 performance, 16,384 CUDA cores, and 512 Tensor cores, the NVIDIA RTX 4090 outshines most consumer-grade GPUs in both compute capability and cost-efficiency.

LLMs Reasoning Tested on Ollama with RTX 4090

The models range from 8B to 40B parameters, covering lightweight to medium-sized LLMs, providing a diverse test scope. This evaluation was conducted using Ollama 0.5.4, assessing the following language models:
  • LLaMA Series: LLaMA 2 (13B), LLaMA 3.1 (8B)
  • Qwen Series: Qwen (14B, 32B)
  • Phi Series: Phi4 (14B)
  • Mistral Models: Mistral-small (22B)
  • Falcon Series: Falcon (40B)
  • Gemma and LLaVA: Gemma2 (27B), LLaVA (34B)

Benchmark Results: Ollama GPU RTX 4090 Performance Metrics

The RTX 4090 performed exceptionally well, particularly with small to medium-sized models. Key metrics are summarized below:
Modelsdeepseek-r1deepseek-r1llama2llama3.1qwen2.5qwen2.5gemma2phi4qwqllava
Parameters14b32b13b8b14b32b27b14b32b34b
Size9207.44.9920169.12019
Quantization4444444444
Running onOllama0.5.7Ollama0.5.7Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4
Downloading Speed(mb/s)12121212121212121212
CPU Rate2%3%1%2%3%3%2%3%2%2%
RAM Rate3%3%3%3%3%3%3%3%3%3%
GPU vRAM45%90%41%65%45%90%78%47%90%92%
GPU UTL95%98%92%94%96%97%96%97%99%97%
Eval Rate(tokens/s)58.6234.2270.9095.5163.9234.3937.9768.6231.8036.67
A video to record real-time RTX4090 gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run deepseek-r1:32bollama run llama2:13bollama run llama3.1:8bollama run qwen2.5:14bollama run qwen2.5:32bollama run gemma2:27bollama run phi4:14bollama run qwq:32bollama run llava:34bollama run qwen:14bollama run qwen:32b

Key Insights

1. Small to Mid-Sized Models (8B-34B)

The RTX 4090 excels in hosting lightweight and mid-range LLMs, with evaluation speeds of up to 70 tokens/s. For models such as LLaMA 2 (13B) and deepseek-r1 (34B), it consistently utilized 92%-96% of its GPU capacity, demonstrating high efficiency without overloading the CPU (1%-3% utilization).

2. Not Enough for 40B Models

For larger models like Falcon (40B), the GPU showed signs of strain, with evaluation speeds dropping to 8.61 tokens/s. Despite Ollama's 4-bit quantization, the 24GB VRAM limit made it difficult for the RTX 4090 to handle these workloads at optimal speeds.

3. Cost-Effective

Ideal for LLM benchmarks involving models up to 36B parameters. Delivering comparable performance for small-to-medium-scale inference tasks.
MetricValue for Various Models
Downloading Speed12 MB/s for all models, 118 MB/s When a 1gbps bandwidth add-on ordered.
CPU Utilization RateMaintain 1-3%
RAM Utilization RateMaintain 2-4%
GPU vRAM Utilization41-92%. The larger the model, the higher the utilization rate.
GPU Utilization92%+. Maintain high utilization rate.
Evaluation Speed30+ tokens/s. It is recommended to use models below 36b.

Performance Comparison with Other Graphics Cards

Among 24GB GPUs, the RTX 4090 dominates in FP32 performance and CUDA core count, making it the most cost-effective choice for inference tasks. Although A6000 can process 36-70b models, its computing throughput is significantly lower than 4090 when processing 8-36b models. Despite its strengths, the RTX 4090 cannot efficiently handle models larger than 40B parameters. Such models often require 48GB VRAM or more (e.g., NVIDIA A6000, H100 or A100), highlighting hardware limitations for ultra-large LLMs.
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - RTX 4090

302.00/mo
44% Off Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - RTX A6000

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Multi-GPU Dedicated Server- 2xRTX 4090

729.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS

Summary and Recommendations

The NVIDIA RTX 4090 benchmark showcases its exceptional ability to handle LLM inference workloads, particularly for small-to-medium models. Paired with Ollama, this setup provides a robust, cost-effective solution for developers and enterprises seeking high performance without breaking the bank.

However, for larger models and extended scalability, GPUs like the NVIDIA A100 or multi-GPU configurations might be necessary. Nonetheless, for most practical use cases, the RTX 4090 hosting solution is a game-changer in the world of LLM benchmarks.

Tags:

RTX 4090, Ollama, LLMs, AI benchmarks, GPU performance, NVIDIA GPU, AI hosting, 4-bit quantization, Falcon 40B, LLaMA 2, GPU server, machine learning, RTX 4090 benchmark, Ollama benchmark, NVIDIA GPU, LLM performance, GPU hosting, 4-bit quantization, Falcon 40B, LLaMA 2 benchmark, AI model hosting, GPU server performance