Introduction
Large Language Models (LLMs) require substantial GPU power for efficient inference and fine-tuning. If you're running models on the Ollama platform, selecting the right NVIDIA GPU is crucial for performance and cost-effectiveness. This guide explores the relationship between model sizes and GPU memory requirements and recommends the best NVIDIA GPUs for different workloads.
Understanding Model Size and VRAM Requirements
The size of an LLM is typically measured in parameters, which can range from hundreds of millions to hundreds of billions. The VRAM (Video Random Access Memory) required to run these models efficiently depends on the model's size and the precision of the computations (e.g., FP32, FP16, or INT8).
The Ollama library is designed to optimize the deployment and running of large language models (LLMs) efficiently, especially on consumer-grade hardware. While not all models in the Ollama library are strictly 4-bit quantized, many of them are optimized using quantization techniques, including 4-bit quantization, to reduce their memory footprint and computational requirements.
General Rule of Thumb:
Tiny Models (100M - 2B parameters): These models can often run on consumer-grade GPUs with 2-4GB of VRAM.
Small Models (2B - 10B parameters): These models can often run on consumer-grade GPUs with 6-16GB of VRAM.
Medium Models (10B - 20B parameters): These models typically require 16-24GB of VRAM.
Large Models (20B - 70B parameters): These models need high-end GPUs with 24-48GB of VRAM.
Very Large Models (70B - 110B parameters): These models need high-end GPUs with 80GB+ of VRAM.
Super Large Models (110B+ parameters): These models often require multiple high-end GPUs with 80GB+ of VRAM each.
Note: To run LLMs efficiently, the GPU memory requirement will be slightly higher than the model size (e.g. 1.2x), because additional memory is needed to store intermediate calculation results, optimizer state (if training), and input data.
Popular LLMs and Their GPU Recommendations
Model Name | Params | Model Size | Recommended GPU cards |
---|---|---|---|
DeepSeek R1 | 1.5B | 1.1GB | K620 2GB or higher |
DeepSeek R1 | 7B | 4.7GB | GTX 1660 6GB or higher |
DeepSeek R1 | 8B | 4.9GB | GTX 1660 6GB or higher |
DeepSeek R1 | 14B | 9.0GB | RTX A4000 16GB or higher |
DeepSeek R1 | 32B | 20GB | RTX 4090, RTX A5000 24GB, A100 40GB |
DeepSeek R1 | 70B | 43GB | RTX A6000, A40 48GB, 2xRTX 4090 |
DeepSeek R1 | 671B | 404GB | Contact us, or leave a message below |
Deepseek-coder-v2 | 16B | 8.9GB | RTX A4000 16GB or higher |
Deepseek-coder-v2 | 236B | 133GB | 2xA100 80GB, 4xA6000 48GB |
Deepseek-coder | 33B | 19GB | RTX 4090 24GB, RTX A5000 24GB |
Deepseek-coder | 6.7B | 3.8GB | GTX 1660 6GB or higher |
Qwen2.5 | 0.5B | 398MB | K620 2GB |
Qwen2.5 | 1.5B | 986MB | K620 2GB |
Qwen2.5 | 3B | 1.9GB | Quadro P1000 4GB or higher |
Qwen2.5 | 7B | 4.7GB | GTX 1660 6GB or higher |
Qwen2.5 | 14B | 9GB | RTX A4000 16GB or higher |
Qwen2.5 | 32B | 20GB | RTX 4090 24GB, RTX A5000 24GB |
Qwen2.5 | 72B | 47GB | 3xRTX A5000, A100 80GB, H100 |
Qwen 2.5 Coder | 7B | 4.7GB | GTX 1660 6GB or higher |
Qwen 2.5 Coder | 14B | 9.0GB | RTX A4000 16GB or higher |
Qwen 2.5 Coder | 32B | 20GB | RTX 4090 24GB, RTX A5000 24GB or higher |
Qwen 2 | 72B | 41GB | RTX A6000 48GB, A40 48GB or higher |
Qwen 2 | 7B | 4.4GB | GTX 1660 6GB or higher |
Qwen 1.5 | 7B | 4.5GB | GTX 1660 6GB or higher |
Qwen 1.5 | 7B | 4.5GB | GTX 1660 6GB or higher |
Qwen 1.5 | 14B | 8.2GB | RTX A4000 16GB or higher |
Qwen 1.5 | 32B | 18GB | RTX 4090 24GB, A5000 24GB |
Qwen 1.5 | 72B | 41GB | RTX A6000 48GB, A40 48GB |
Qwen 1.5 | 110B | 63GB | A100 80GB, H100 |
Gemma 2 | 2B | 1.6GB | Quadro P1000 4GB or higher |
Gemma 2 | 9B | 5.4GB | RTX 3060 Ti 8GB or higher |
Gemma 2 | 27B | 16GB | RTX 4090, A5000 or higher |
Phi-4 | 14B | 9.1GB | RTX A4000 16GB or higher |
Phi-3 | 3.8B | 2.2GB | Quadro P1000 4GB or higher |
Phi-3 | 14B | 7.9GB | RTX A4000 16GB or higher |
Llama 3.3 | 70B | 43GB | A6000 48GB, A40 48GB, or higher |
Llama 3.2 | 3B | 2GB | Quadro P1000 4GB or higher |
Llama 3.1 | 8B | 4.9GB | GTX 1660 6GB or higher |
Llama 3.1 | 70B | 43GB | A6000 48GB, A40 48GB, or higher |
Llama 3.1 | 405B | 243GB | 4xA100 80GB, or higher |
Llama 3 | 8B | 4.7GB | GTX 1660 6GB or higher |
Llama 3 | 70B | 40GB | A6000 48GB, A40 48GB, or higher |
Mistral | 7B | 4.1GB | GTX 1660 6GB or higher |
Mixtral | 8x7B | 26GB | A6000 48GB, A40 48GB, or higher |
Mixtral | 8x22B | 80GB | 2xA6000, 2xA100 80GB, or higher |
LLaVA | 7B | 4.7GB | GTX 1660 6GB or higher |
LLaVA | 13B | 8.0GB | RTX A4000 16GB or higher |
LLaVA | 34B | 20GB | RTX 4090 24GB, A5000 24GB |
Code Llama | 7B | 3.8GB | GTX 1660 6GB or higher |
Code Llama | 13B | 7.4GB | RTX A4000 16GB or higher |
Code Llama | 34B | 19GB | RTX 4090 24GB, A5000 24GB |
Code Llama | 70B | 39GB | A6000 48GB, A40 48GB, or higher |
Conclusion
Choosing the right GPU for LLMs on Ollama depends on your model size, VRAM requirements, and budget. Consumer GPUs like the RTX A4000 and 4090 are powerful and cost-effective, while enterprise solutions like the A100 and H100 offer unmatched performance for massive models. Ensure your GPU choice aligns with your specific use case to optimize efficiency and cost.
GPU Server Recommendation
Professional GPU VPS - A4000
- 32GB RAM
- 24 CPU Cores
- 320GB SSD
- 300Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Quadro RTX A4000
- CUDA Cores: 6,144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
- Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.
Advanced GPU Dedicated Server - A5000
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A5000
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Enterprise GPU Dedicated Server - RTX 4090
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: GeForce RTX 4090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 16,384
- Tensor Cores: 512
- GPU Memory: 24 GB GDDR6X
- FP32 Performance: 82.6 TFLOPS
- Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Enterprise GPU Dedicated Server - RTX A6000
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
- Optimally running AI, deep learning, data visualization, HPC, etc.
Multi-GPU Dedicated Server- 2xRTX 5090
- 256GB RAM
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Enterprise GPU Dedicated Server - A100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
Enterprise GPU Dedicated Server - A100(80GB)
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 80GB HBM2e
- FP32 Performance: 19.5 TFLOPS
Multi-GPU Dedicated Server - 4xA100
- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 4 x Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
If you can't find a suitable GPU Plan, or have a need to customize a GPU server, or have ideas for cooperation, please leave me a message. We will reach you back within 36 hours.