RTX2060 Ollama Benchmark: Best GPU for 3B LLMs Inference

With the rise of local large language model (LLM) inference, many AI enthusiasts and developers are looking for cost-effective solutions. One such popular choice is running models using Ollama on an Nvidia RTX2060 GPU. In this benchmark, we evaluate the performance of various LLMs on a dedicated RTX2060 server, analyzing their inference speed, GPU utilization, and overall feasibility for small-scale deployment.

This RTX2060 Ollama benchmark aims to answer the question: Can an Nvidia RTX2060 effectively handle LLMs like DeepSeek, Llama 3, Mistral, and Qwen? And if so, which models provide the best trade-off between performance and resource consumption?

Test Server Configuration

Before diving into the Ollama RTX 2060 benchmark, let's take a look at the server specs:

Server Configuration:

  • Price: $199/month
  • CPU: Intel Dual 10-Core E5-2660 v2
  • RAM: 128GB
  • Storage: 120GB + 960GB SSD
  • Network: 100Mbps Unmetered
  • OS: Windows 11 Pro

GPU Details:

  • GPU: Nvidia GeForce RTX 2060
  • Compute Capability: 7.5
  • Microarchitecture: Ampere
  • CUDA Cores: 1920
  • Tensor Cores: 240
  • Memory: 6GB GDDR6
  • FP32 Performance: 5.0 TFLOPS

This setup allows us to explore RTX2060 for small LLM inference, focusing on models up to 3B parameters due to the 6GB VRAM limitation.

Benchmark Results: Ollama on Nvidia RTX2060

For testing, we utilized Ollama 0.5.11 to benchmark a variety of LLMs on the Nvidia RTX 2060 GPU. The results provide valuable insight into the GPU's performance when tasked with smaller language models.
Modelsdeepseek-r1deepseek-r1deepseek-r1deepseek-coderllama3.2llama3.1codellamamistralgemmacodegemmaqwen2.5qwen2.5
Parameters1.5b7b8b6.7b3b8b7b7b7b7b3b7b
Size(GB)1.14.74.93.82.04.93.84.15.05.01.94.7
Quantization444444444444
Running onOllama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11
Downloading Speed(mb/s)121212121212121212121212
CPU Rate7%46%46%42%7%51%41%7%51%53%7%45%
RAM Rate5%6%6%5%5%6%5%5%7%7%5%6%
GPU UTL39%35%32%35%56%31%35%21%12%11%43%36%
Eval Rate(tokens/s)43.128.847.5213.6250.417.3913.2148.573.703.6936.028.98
A video to record real-time RTX 2060 GPU Server resource consumption data:
Screen Shoots for Bechmarking LLMs on Ollama with Nvidia RTX2060 GPU Server
ollama run deepseek-r1:1.5bollama run deepseek-r1:7bollama run deepseek-r1:8bollama run deepseek-coder:6.7bollama run llama3.2:3bollama run llama3.1:8bollama run codellama:7bollama run mistral:7bollama run gemma:7bollama run codegemma:7bollama run qwen2.5:3bollama run qwen2.5:7b

Key Findings from the Benchmark

1️⃣. RTX2060 performs well with 3B parameter models)

Llama3.2 (3B) delivered the fastest inference speed (50.41 tokens/s), making it the best choice for RTX2060 small LLM inference. Qwen2.5 (3B) also ran well at 36.02 tokens/s, though slightly slower than Llama3.2.

2️⃣. RTX2060 struggles with 7B+ models

Models like Mistral 7B, DeepSeek 7B, and Llama 3.1 (8B) experienced low inference speeds (7-9 tokens/s) and near 80% VRAM usage.While technically runnable, the performance is too slow for real-time applications.

3️⃣. Efficient utilization at 3B models

GPU utilization stayed within 50-60% for models ≤3B. RAM and CPU usage remained low (<10% CPU, <6% RAM).

Get Started with RTX2060 Hosting for Small LLMs

For those deploying LLMs on Ollama, choosing the right NVIDIA RTX2060 hosting solution can significantly impact performance and costs. If you're working with 0.5b-3b model, the RTX2060 is a solid choice for AI inference at an affordable price.
Flash Sale to Mar.16

Professional GPU Dedicated Server - RTX 2060

109.0/mo
45% OFF Recurring (Was $199.00)
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 10-Core E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce RTX 2060
  • Microarchitecture: Ampere
  • CUDA Cores: 1920
  • Tensor Cores: 240
  • GPU Memory: 6GB GDDR6
  • FP32 Performance: 6.5 TFLOPS
  • Powerful for Gaming, OBS Streaming, Video Editing, Android Emulators, 3D Rendering, etc
Flash Sale to Mar.16

Basic GPU Dedicated Server - RTX 4060

106.00/mo
40% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Eight-Core E5-2690
  • 120GB SSD + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce RTX 4060
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 3072
  • Tensor Cores: 96
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 15.11 TFLOPS
  • Ideal for video edting, rendering, android emulators, gaming and light AI tasks.

Advanced GPU Dedicated Server - RTX 3060 Ti

179.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • CUDA Cores: 4864
  • Tensor Cores: 152
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 16.2 TFLOPS
Flash Sale to Mar.16

Professional GPU VPS - A4000

102.00/mo
43% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 32GB RAM
  • 24 CPU Cores
  • 320GB SSD
  • 300Mbps Unmetered Bandwidth
  • Once per 2 Weeks Backup
  • OS: Linux / Windows 10
  • Dedicated GPU: Quadro RTX A4000
  • CUDA Cores: 6,144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPS
  • Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Conclusion: RTX2060 is Best for 3B Models

If you're looking for a budget-friendly LLM server using Ollama on an RTX2060, your best bet is 3B parameter models like Llama3.2 and Qwen2.5.

Final Recommendations

  • For fast inference → Llama3.2 (3B)
  • For alternative choices → Qwen2.5 (3B)
  • Avoid models over 7B due to low speed and high VRAM use

This RTX2060 Ollama benchmark shows that while Nvidia RTX2060 hosting is viable for small LLM inference, it is not suitable for models larger than 3B parameters. If you require 7B+ model performance, consider a higher-end GPU like RTX 3060/A4000 servers.

Tags:

RTX2060 Ollama benchmark, RTX2060 AI inference, best LLM for RTX2060, Nvidia RTX2060 hosting, Llama 3 RTX2060, Qwen RTX2060, Mistral AI benchmark, DeepSeek AI, small LLM inference, budget AI GPU