This setup makes it a viable option for Nvidia RTX 4060 hosting to run LLM inference workloads efficiently while keeping costs in check.
Models | deepseek-r1 | deepseek-r1 | deepseek-coder | llama2 | llama3.1 | codellama | mistral | gemma | gemma2 | codegemma | qwen2.5 | codeqwen |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameters | 7b | 8b | 6.7b | 13b | 8b | 7b | 7b | 7b | 9b | 7b | 7b | 7b |
Size(GB) | 4.7 | 4.9 | 3.8 | 7.4 | 4.9 | 3.8 | 4.1 | 5.0 | 5.4 | 5.0 | 4.7 | 4.2 |
Quantization | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Running on | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 | Ollama0.5.11 |
Downloading Speed(mb/s) | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
CPU Rate | 7% | 7% | 8% | 37% | 8% | 7% | 8% | 32% | 30% | 32% | 7% | 9% |
RAM Rate | 8% | 8% | 8% | 12% | 8% | 7% | 8% | 9% | 10% | 9% | 8% | 8% |
GPU UTL | 83% | 81% | 91% | 25-42% | 88% | 92% | 90% | 42% | 44% | 46% | 72% | 93% |
Eval Rate(tokens/s) | 42.60 | 41.27 | 52.93 | 8.03 | 41.70 | 52.69 | 50.91 | 22.39 | 18.0 | 23.18 | 42.23 | 43.12 |
Due to memory limitations, LLaMA 2 (13B) performs poorly on RTX 4060 Server with low GPU utilization (25-42%), indicating that RTX 4060 cannot be used to infer models 13b and above.
Basic GPU Dedicated Server - RTX 4060
Professional GPU VPS - A4000
Advanced GPU Dedicated Server - V100
Enterprise GPU Dedicated Server - RTX 4090
Nvidia RTX 4060 is a cost-effective choice for models of 9B and below. Most models of 7B-8B and below run smoothly, with GPU utilization of 70%-90%, and inference speed stable at 40+ tokens/s. It is a cost-effective LLM inference solution.
Ollama 4060, Ollama RTX4060, Nvidia RTX4060 hosting, Benchmark RTX4060, Ollama benchmark, RTX4060 for LLMs inference, Nvidia RTX4060 rental