Benchmarking LLMs on Ollama: Performance of Nvidia RTX 4060 GPU Server

As large language models (LLMs) continue to gain traction, running them efficiently on consumer-grade GPUs has become a hot topic. In this benchmark, we test Ollama on a dedicated Nvidia RTX 4060 server to evaluate its performance in LLM inference. If you're looking for a high-performance yet affordable LLM hosting solution, this RTX 4060 benchmark will help you decide whether it's the right choice for your AI workload.

Test Server Configuration

Before diving into the Ollama 4060 benchmark, let's take a look at the server specs:

Server Configuration:

  • Price: $149/month
  • CPU: Intel Eight-Core E5-2690
  • RAM: 64GB
  • Storage: 120GB SSD + 960GB SSD
  • Network: 100Mbps Unmetered
  • OS: Windows 11 Pro

GPU Details:

  • GPU: Nvidia GeForce RTX 4060
  • Compute Capability: 8.9
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 3072
  • Tensor Cores: 96
  • Memory: 8GB GDDR6
  • FP32 Performance: 15.11 TFLOPS

This setup makes it a viable option for Nvidia RTX 4060 hosting to run LLM inference workloads efficiently while keeping costs in check.

Ollama Benchmark: Testing LLMs on NVIDIA RTX4060 Server

For this Ollama RTX 4060 benchmark, we used the latest Ollama 0.5.11 runtime and tested various popular LLMs with 4-bit quantization to optimize for the 8GB VRAM constraint.
Modelsdeepseek-r1deepseek-r1deepseek-coderllama2llama3.1codellamamistralgemmagemma2codegemmaqwen2.5codeqwen
Parameters7b8b6.7b13b8b7b7b7b9b7b7b7b
Size(GB)4.74.93.87.44.93.84.15.05.45.04.74.2
Quantization444444444444
Running onOllama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11
Downloading Speed(mb/s)121212121212121212121212
CPU Rate7%7%8%37%8%7%8%32%30%32%7%9%
RAM Rate8%8%8%12%8%7%8%9%10%9%8%8%
GPU UTL83%81%91%25-42%88%92%90%42%44%46%72%93%
Eval Rate(tokens/s)42.6041.2752.938.0341.7052.6950.9122.3918.023.1842.2343.12
A video to record real-time RTX 4060 GPU Server resource consumption data:
Screen Shoots for Bechmarking LLMs on Ollama with Nvidia RTX4060 GPU Server
ollama run deepseek-r1:7bollama run deepseek-r1:8bollama run deepseek-coder:6.7bollama run llama2:13bollama run llama3.1:8bollama run codellama:7bollama run mistral:7bollama run gemma:7bollama run gemma2:9bollama run codegemma:7bollama run qwen2.5:7bollama run codeqwen:7b

Performance Analysis from the Benchmark

1️⃣. Best choice for models under 5.0GB

After 4-bit quantization, models under 5.0GB can achieve inference speeds of 40+ tokens/s, with DeepSeek-Coder and Mistral leading the Ollama benchmark at 52-53 tokens/s, making them ideal for high-speed inference.

2️⃣. Not suitable for models 13b and above

Due to memory limitations, LLaMA 2 (13B) performs poorly on RTX 4060 Server with low GPU utilization (25-42%), indicating that RTX 4060 cannot be used to infer models 13b and above.

3️⃣. After the model size reaches 5.0GB, the speed drops from 40+ to 20+ tokens/s

CodeGemma and Gemma 2 have high CPU utilization (30-32%), indicating that they may not be the best choice for efficient GPU inference, possibly due to their model size above 5.0gb, resulting in reduced inference speed.

4️⃣. Cost-effective choice for models of 8b and below

Most 7B-8B and below models can run at a relatively high speed, with GPU utilization reaching 70%-90%, providing stable performance at 40+ tokens/s.

Is the Nvidia RTX 4060 Server Good for LLM Inference??

✅ Pros of Using NNvidia RTX 4060 or Ollama

  • Affordable $149/month RTX 4060 hosting
  • Good performance for 7B-8B models
  • DeepSeek-Coder & Mistral run efficiently at 50+ tokens/s

❌ Limitations of Nvidia RTX 4060 Server for Ollama

  • Limited VRAM (8GB) struggles with 13B models
  • LLaMA 2 (13B) underutilizes GPU
If you're running 7B-8B models, RTX 4060 for LLMs inference is an excellent budget-friendly option. However, for 13B+ models, you might need Nvidia V100 or A4000 hosting with higher VRAM.

Get Started with RTX4060 Hosting for LLMs

For those deploying LLMs on Ollama, choosing the right NVIDIA RTX4060 hosting solution can significantly impact performance and costs. If you're working with 7B-9B models and below, the RTX4060 is a solid choice for AI inference at an affordable price.
Flash Sale to Mar.16

Basic GPU Dedicated Server - RTX 4060

106.00/mo
40% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Eight-Core E5-2690
  • 120GB SSD + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce RTX 4060
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 3072
  • Tensor Cores: 96
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 15.11 TFLOPS
  • Ideal for video edting, rendering, android emulators, gaming and light AI tasks.
Flash Sale to Mar.16

Professional GPU VPS - A4000

102.00/mo
43% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 32GB RAM
  • 24 CPU Cores
  • 320GB SSD
  • 300Mbps Unmetered Bandwidth
  • Once per 2 Weeks Backup
  • OS: Linux / Windows 10
  • Dedicated GPU: Quadro RTX A4000
  • CUDA Cores: 6,144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPS
  • Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Advanced GPU Dedicated Server - V100

229.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2690v3
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia V100
  • Microarchitecture: Volta
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPS
  • Cost-effective for AI, deep learning, data visualization, HPC, etc

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Summary

Nvidia RTX 4060 is a cost-effective choice for models of 9B and below. Most models of 7B-8B and below run smoothly, with GPU utilization of 70%-90%, and inference speed stable at 40+ tokens/s. It is a cost-effective LLM inference solution.

Tags:

Ollama 4060, Ollama RTX4060, Nvidia RTX4060 hosting, Benchmark RTX4060, Ollama benchmark, RTX4060 for LLMs inference, Nvidia RTX4060 rental