Large Language Models (LLMs) require substantial GPU power for efficient inference and fine-tuning. If you're running models on the Ollama platform, selecting the right NVIDIA GPU is crucial for performance and cost-effectiveness. This guide explores the relationship between model sizes and GPU memory requirements and recommends the best NVIDIA GPUs for different workloads.
The size of an LLM is typically measured in parameters, which can range from hundreds of millions to hundreds of billions. The VRAM (Video Random Access Memory) required to run these models efficiently depends on the model's size and the precision of the computations (e.g., FP32, FP16, or INT8).
The Ollama library is designed to optimize the deployment and running of large language models (LLMs) efficiently, especially on consumer-grade hardware. While not all models in the Ollama library are strictly 4-bit quantized, many of them are optimized using quantization techniques, including 4-bit quantization, to reduce their memory footprint and computational requirements.
General Rule of Thumb:
Tiny Models (100M - 2B parameters): These models can often run on consumer-grade GPUs with 2-4GB of VRAM.
Small Models (2B - 10B parameters): These models can often run on consumer-grade GPUs with 6-16GB of VRAM.
Medium Models (10B - 20B parameters): These models typically require 16-24GB of VRAM.
Large Models (20B - 70B parameters): These models need high-end GPUs with 24-48GB of VRAM.
Very Large Models (70B - 110B parameters): These models need high-end GPUs with 80GB+ of VRAM.
Super Large Models (110B+ parameters): These models often require multiple high-end GPUs with 80GB+ of VRAM each.
Note: To run LLMs efficiently, the GPU memory requirement will be slightly higher than the model size (e.g. 1.2x), because additional memory is needed to store intermediate calculation results, optimizer state (if training), and input data.
Model Name | Params | Model Size | Recommended GPU cards |
---|---|---|---|
DeepSeek R1 | 1.5B | 1.1GB | K620 2GB or higher |
DeepSeek R1 | 7B | 4.7GB | GTX 1660 6GB or higher |
DeepSeek R1 | 8B | 4.9GB | GTX 1660 6GB or higher |
DeepSeek R1 | 14B | 9.0GB | RTX A4000 16GB or higher |
DeepSeek R1 | 32B | 20GB | RTX 4090, RTX A5000 24GB, A100 40GB |
DeepSeek R1 | 70B | 43GB | RTX A6000, A40 48GB, 2xRTX 4090 |
DeepSeek R1 | 671B | 404GB | Contact us, or leave a message below |
Deepseek-coder-v2 | 16B | 8.9GB | RTX A4000 16GB or higher |
Deepseek-coder-v2 | 236B | 133GB | 2xA100 80GB, 4xA6000 48GB |
Deepseek-coder | 33B | 19GB | RTX 4090 24GB, RTX A5000 24GB |
Deepseek-coder | 6.7B | 3.8GB | GTX 1660 6GB or higher |
Qwen2.5 | 0.5B | 398MB | K620 2GB |
Qwen2.5 | 1.5B | 986MB | K620 2GB |
Qwen2.5 | 3B | 1.9GB | Quadro P1000 4GB or higher |
Qwen2.5 | 7B | 4.7GB | GTX 1660 6GB or higher |
Qwen2.5 | 14B | 9GB | RTX A4000 16GB or higher |
Qwen2.5 | 32B | 20GB | RTX 4090 24GB, RTX A5000 24GB |
Qwen2.5 | 72B | 47GB | 3xRTX A5000, A100 80GB, H100 |
Qwen 2.5 Coder | 7B | 4.7GB | GTX 1660 6GB or higher |
Qwen 2.5 Coder | 14B | 9.0GB | RTX A4000 16GB or higher |
Qwen 2.5 Coder | 32B | 20GB | RTX 4090 24GB, RTX A5000 24GB or higher |
Qwen 2 | 72B | 41GB | RTX A6000 48GB, A40 48GB or higher |
Qwen 2 | 7B | 4.4GB | GTX 1660 6GB or higher |
Qwen 1.5 | 7B | 4.5GB | GTX 1660 6GB or higher |
Qwen 1.5 | 7B | 4.5GB | GTX 1660 6GB or higher |
Qwen 1.5 | 14B | 8.2GB | RTX A4000 16GB or higher |
Qwen 1.5 | 32B | 18GB | RTX 4090 24GB, A5000 24GB |
Qwen 1.5 | 72B | 41GB | RTX A6000 48GB, A40 48GB |
Qwen 1.5 | 110B | 63GB | A100 80GB, H100 |
Gemma 2 | 2B | 1.6GB | Quadro P1000 4GB or higher |
Gemma 2 | 9B | 5.4GB | RTX 3060 Ti 8GB or higher |
Gemma 2 | 27B | 16GB | RTX 4090, A5000 or higher |
Phi-4 | 14B | 9.1GB | RTX A4000 16GB or higher |
Phi-3 | 3.8B | 2.2GB | Quadro P1000 4GB or higher |
Phi-3 | 14B | 7.9GB | RTX A4000 16GB or higher |
Llama 3.3 | 70B | 43GB | A6000 48GB, A40 48GB, or higher |
Llama 3.2 | 3B | 2GB | Quadro P1000 4GB or higher |
Llama 3.1 | 8B | 4.9GB | GTX 1660 6GB or higher |
Llama 3.1 | 70B | 43GB | A6000 48GB, A40 48GB, or higher |
Llama 3.1 | 405B | 243GB | 4xA100 80GB, or higher |
Llama 3 | 8B | 4.7GB | GTX 1660 6GB or higher |
Llama 3 | 70B | 40GB | A6000 48GB, A40 48GB, or higher |
Mistral | 7B | 4.1GB | GTX 1660 6GB or higher |
Mixtral | 8x7B | 26GB | A6000 48GB, A40 48GB, or higher |
Mixtral | 8x22B | 80GB | 2xA6000, 2xA100 80GB, or higher |
LLaVA | 7B | 4.7GB | GTX 1660 6GB or higher |
LLaVA | 13B | 8.0GB | RTX A4000 16GB or higher |
LLaVA | 34B | 20GB | RTX 4090 24GB, A5000 24GB |
Code Llama | 7B | 3.8GB | GTX 1660 6GB or higher |
Code Llama | 13B | 7.4GB | RTX A4000 16GB or higher |
Code Llama | 34B | 19GB | RTX 4090 24GB, A5000 24GB |
Code Llama | 70B | 39GB | A6000 48GB, A40 48GB, or higher |
Choosing the right GPU for LLMs on Ollama depends on your model size, VRAM requirements, and budget. Consumer GPUs like the RTX A4000 and 4090 are powerful and cost-effective, while enterprise solutions like the A100 and H100 offer unmatched performance for massive models. Ensure your GPU choice aligns with your specific use case to optimize efficiency and cost.
Professional GPU VPS - A4000
Advanced GPU Dedicated Server - A5000
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - RTX A6000
Multi-GPU Dedicated Server- 4xRTX 5090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
Multi-GPU Dedicated Server - 4xA100
If you can't find a suitable GPU Plan, or have a need to customize a GPU server, or have ideas for cooperation, please leave me a message. We will reach you back within 36 hours.