This setup allows us to explore RTX2060 for small LLM inference, focusing on models up to 3B parameters due to the 6GB VRAM limitation.
Models | deepseek-r1 | deepseek-r1 | deepseek-r1 | deepseek-coder | llama3.2 | llama3.1 | codellama | mistral | gemma | codegemma | qwen2.5 | qwen2.5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameters | 1.5b | 7b | 8b | 6.7b | 3b | 8b | 7b | 7b | 7b | 7b | 3b | 7b |
Size(GB) | 1.1 | 4.7 | 4.9 | 3.8 | 2.0 | 4.9 | 3.8 | 4.1 | 5.0 | 5.0 | 1.9 | 4.7 |
Quantization | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Running on | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 |
Downloading Speed(mb/s) | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
CPU Rate | 7% | 46% | 46% | 42% | 7% | 51% | 41% | 7% | 51% | 53% | 7% | 45% |
RAM Rate | 5% | 6% | 6% | 5% | 5% | 6% | 5% | 5% | 7% | 7% | 5% | 6% |
GPU UTL | 39% | 35% | 32% | 35% | 56% | 31% | 35% | 21% | 12% | 11% | 43% | 36% |
Eval Rate(tokens/s) | 43.12 | 8.84 | 7.52 | 13.62 | 50.41 | 7.39 | 13.21 | 48.57 | 3.70 | 3.69 | 36.02 | 8.98 |
Models like Mistral 7B, DeepSeek 7B, and Llama 3.1 (8B) experienced low inference speeds (7-9 tokens/s) and near 80% VRAM usage.While technically runnable, the performance is too slow for real-time applications.
Professional GPU Dedicated Server - RTX 2060
Basic GPU Dedicated Server - RTX 4060
Advanced GPU Dedicated Server - RTX 3060 Ti
Professional GPU VPS - A4000
If you're looking for a budget-friendly LLM server using Ollama on an RTX2060, your best bet is 3B parameter models like Llama3.2 and Qwen2.5.
This RTX2060 Ollama benchmark shows that while Nvidia RTX2060 hosting is viable for small LLM inference, it is not suitable for models larger than 3B parameters. If you require 7B+ model performance, consider a higher-end GPU like RTX 3060/A4000 servers.
RTX2060 Ollama benchmark, RTX2060 AI inference, best LLM for RTX2060, Nvidia RTX2060 hosting, Llama 3 RTX2060, Qwen RTX2060, Mistral AI benchmark, DeepSeek AI, small LLM inference, budget AI GPU