

Benchmarking LLMs on Ollama: Performance of Nvidia RTX 4060 GPU Server

As large language models (LLMs) continue to gain traction, running them efficiently on consumer-grade GPUs has become a hot topic. In this benchmark, we test Ollama on a dedicated Nvidia RTX 4060 server to evaluate its performance in LLM inference. If you're looking for a high-performance yet affordable LLM hosting solution, this RTX 4060 benchmark will help you decide whether it's the right choice for your AI workload.

Test Server Configuration

Before diving into the Ollama 4060 benchmark, let's take a look at the server specs:

Server Configuration:

Price: $149/month
CPU: Intel Eight-Core E5-2690
RAM: 64GB
Storage: 120GB SSD + 960GB SSD
Network: 100Mbps Unmetered
OS: Windows 11 Pro

GPU Details:

GPU: Nvidia GeForce RTX 4060
Compute Capability: 8.9
Microarchitecture: Ada Lovelace
CUDA Cores: 3072
Tensor Cores: 96
Memory: 8GB GDDR6
FP32 Performance: 15.11 TFLOPS

This setup makes it a viable option for Nvidia RTX 4060 hosting to run LLM inference workloads efficiently while keeping costs in check.

Ollama Benchmark: Testing LLMs on NVIDIA RTX4060 Server

For this Ollama RTX 4060 benchmark, we used the latest Ollama 0.5.11 runtime and tested various popular LLMs with 4-bit quantization to optimize for the 8GB VRAM constraint.

Models	deepseek-r1	deepseek-r1	deepseek-coder	llama2	llama3.1	codellama	mistral	gemma	gemma2	codegemma	qwen2.5	codeqwen
Parameters	7b	8b	6.7b	13b	8b	7b	7b	7b	9b	7b	7b	7b
Size(GB)	4.7	4.9	3.8	7.4	4.9	3.8	4.1	5.0	5.4	5.0	4.7	4.2
Quantization	4	4	4	4	4	4	4	4	4	4	4	4
Running on	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11	Ollama0.5.11
Downloading Speed(mb/s)	12	12	12	12	12	12	12	12	12	12	12	12
CPU Rate	7%	7%	8%	37%	8%	7%	8%	32%	30%	32%	7%	9%
RAM Rate	8%	8%	8%	12%	8%	7%	8%	9%	10%	9%	8%	8%
GPU UTL	83%	81%	91%	25-42%	88%	92%	90%	42%	44%	46%	72%	93%
Eval Rate(tokens/s)	42.60	41.27	52.93	8.03	41.70	52.69	50.91	22.39	18.0	23.18	42.23	43.12

A video to record real-time RTX 4060 GPU Server resource consumption data:

Screen Shoots for Bechmarking LLMs on Ollama with Nvidia RTX4060 GPU Server

Performance Analysis from the Benchmark

1️⃣. Best choice for models under 5.0GB

After 4-bit quantization, models under 5.0GB can achieve inference speeds of 40+ tokens/s, with DeepSeek-Coder and Mistral leading the Ollama benchmark at 52-53 tokens/s, making them ideal for high-speed inference.

2️⃣. Not suitable for models 13b and above

Due to memory limitations, LLaMA 2 (13B) performs poorly on RTX 4060 Server with low GPU utilization (25-42%), indicating that RTX 4060 cannot be used to infer models 13b and above.

3️⃣. After the model size reaches 5.0GB, the speed drops from 40+ to 20+ tokens/s

CodeGemma and Gemma 2 have high CPU utilization (30-32%), indicating that they may not be the best choice for efficient GPU inference, possibly due to their model size above 5.0gb, resulting in reduced inference speed.

4️⃣. Cost-effective choice for models of 8b and below

Most 7B-8B and below models can run at a relatively high speed, with GPU utilization reaching 70%-90%, providing stable performance at 40+ tokens/s.

Is the Nvidia RTX 4060 Server Good for LLM Inference??

✅ Pros of Using NNvidia RTX 4060 or Ollama

Affordable $149/month RTX 4060 hosting
Good performance for 7B-8B models
DeepSeek-Coder & Mistral run efficiently at 50+ tokens/s

❌ Limitations of Nvidia RTX 4060 Server for Ollama

Limited VRAM (8GB) struggles with 13B models
LLaMA 2 (13B) underutilizes GPU

If you're running 7B-8B models, RTX 4060 for LLMs inference is an excellent budget-friendly option. However, for 13B+ models, you might need Nvidia V100 or A4000 hosting with higher VRAM.

Get Started with RTX4060 Hosting for LLMs

For those deploying LLMs on Ollama, choosing the right NVIDIA RTX4060 hosting solution can significantly impact performance and costs. If you're working with 7B-9B models and below, the RTX4060 is a solid choice for AI inference at an affordable price.

Basic GPU Dedicated Server - RTX 4060

$ 149.00/mo

1mo3mo12mo24mo

Order Now

64GB RAM
GPU: Nvidia GeForce RTX 4060
Eight-Core E5-2690
120GB SSD + 960GB SSD
100Mbps-1Gbps
OS: Windows / Linux

Single GPU Specifications:
Microarchitecture: Ada Lovelace
CUDA Cores: 3072
Tensor Cores: 96
GPU Memory: 8GB GDDR6
FP32 Performance: 15.11 TFLOPS

Professional GPU VPS - A4000

$ 129.00/mo

1mo3mo12mo24mo

Order Now

32GB RAM
24 CPU Cores
320GB SSD
300Mbps Unmetered Bandwidth

Once per 2 Weeks Backup
OS: Linux / Windows 10/ Windows 11
Dedicated GPU: Quadro RTX A4000
CUDA Cores: 6,144
Tensor Cores: 192
GPU Memory: 16GB GDDR6
FP32 Performance: 19.2 TFLOPS

Advanced GPU Dedicated Server - V100

$ 229.00/mo

1mo3mo12mo24mo

Order Now

128GB RAM
GPU: Nvidia V100
Dual 12-Core E5-2690v3
240GB SSD + 2TB SSD
100Mbps-1Gbps
OS: Windows / Linux

Single GPU Specifications:
Microarchitecture: Volta
CUDA Cores: 5,120
Tensor Cores: 640
GPU Memory: 16GB HBM2
FP32 Performance: 14 TFLOPS

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
GPU: GeForce RTX 4090
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps
OS: Windows / Linux

Single GPU Specifications:
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Summary

Nvidia RTX 4060 is a cost-effective choice for models of 9B and below. Most models of 7B-8B and below run smoothly, with GPU utilization of 70%-90%, and inference speed stable at 40+ tokens/s. It is a cost-effective LLM inference solution.

Tags:

Ollama 4060, Ollama RTX4060, Nvidia RTX4060 hosting, Benchmark RTX4060, Ollama benchmark, RTX4060 for LLMs inference, Nvidia RTX4060 rental