Choosing the Right GPU for LLMs on Ollama

Hot deal! Get up to 53% OFF – As Low As $18.33/Month!>



Introduction

Large Language Models (LLMs) require substantial GPU power for efficient inference and fine-tuning. If you're running models on the Ollama platform, selecting the right NVIDIA GPU is crucial for performance and cost-effectiveness. This guide explores the relationship between model sizes and GPU memory requirements and recommends the best NVIDIA GPUs for different workloads.

Understanding Model Size and VRAM Requirements

The size of an LLM is typically measured in parameters, which can range from hundreds of millions to hundreds of billions. The VRAM (Video Random Access Memory) required to run these models efficiently depends on the model's size and the precision of the computations (e.g., FP32, FP16, or INT8).

The Ollama library is designed to optimize the deployment and running of large language models (LLMs) efficiently, especially on consumer-grade hardware. While not all models in the Ollama library are strictly 4-bit quantized, many of them are optimized using quantization techniques, including 4-bit quantization, to reduce their memory footprint and computational requirements.

General Rule of Thumb:

Tiny Models (100M - 2B parameters): These models can often run on consumer-grade GPUs with 2-4GB of VRAM.

Small Models (2B - 10B parameters): These models can often run on consumer-grade GPUs with 6-16GB of VRAM.

Medium Models (10B - 20B parameters): These models typically require 16-24GB of VRAM.

Large Models (20B - 70B parameters): These models need high-end GPUs with 24-48GB of VRAM.

Very Large Models (70B - 110B parameters): These models need high-end GPUs with 80GB+ of VRAM.

Super Large Models (110B+ parameters): These models often require multiple high-end GPUs with 80GB+ of VRAM each.

Note: To run LLMs efficiently, the GPU memory requirement will be slightly higher than the model size (e.g. 1.2x), because additional memory is needed to store intermediate calculation results, optimizer state (if training), and input data.

Popular LLMs and Their GPU Recommendations

Model Name	Params	Model Size	Recommended GPU cards
DeepSeek R1	1.5B	1.1GB	K620 2GB or higher
DeepSeek R1	7B	4.7GB	GTX 1660 6GB or higher
DeepSeek R1	8B	4.9GB	GTX 1660 6GB or higher
DeepSeek R1	14B	9.0GB	RTX A4000 16GB or higher
DeepSeek R1	32B	20GB	RTX 4090, RTX A5000 24GB, A100 40GB
DeepSeek R1	70B	43GB	RTX A6000, A40 48GB, 2xRTX 4090
DeepSeek R1	671B	404GB	Contact us, or leave a message below
Deepseek-coder-v2	16B	8.9GB	RTX A4000 16GB or higher
Deepseek-coder-v2	236B	133GB	2xA100 80GB, 4xA6000 48GB
Deepseek-coder	33B	19GB	RTX 4090 24GB, RTX A5000 24GB
Deepseek-coder	6.7B	3.8GB	GTX 1660 6GB or higher
Qwen2.5	0.5B	398MB	K620 2GB
Qwen2.5	1.5B	986MB	K620 2GB
Qwen2.5	3B	1.9GB	Quadro P1000 4GB or higher
Qwen2.5	7B	4.7GB	GTX 1660 6GB or higher
Qwen2.5	14B	9GB	RTX A4000 16GB or higher
Qwen2.5	32B	20GB	RTX 4090 24GB, RTX A5000 24GB
Qwen2.5	72B	47GB	3xRTX A5000, A100 80GB, H100
Qwen 2.5 Coder	7B	4.7GB	GTX 1660 6GB or higher
Qwen 2.5 Coder	14B	9.0GB	RTX A4000 16GB or higher
Qwen 2.5 Coder	32B	20GB	RTX 4090 24GB, RTX A5000 24GB or higher
Qwen 2	72B	41GB	RTX A6000 48GB, A40 48GB or higher
Qwen 2	7B	4.4GB	GTX 1660 6GB or higher
Qwen 1.5	7B	4.5GB	GTX 1660 6GB or higher
Qwen 1.5	7B	4.5GB	GTX 1660 6GB or higher
Qwen 1.5	14B	8.2GB	RTX A4000 16GB or higher
Qwen 1.5	32B	18GB	RTX 4090 24GB, A5000 24GB
Qwen 1.5	72B	41GB	RTX A6000 48GB, A40 48GB
Qwen 1.5	110B	63GB	A100 80GB, H100
Gemma 2	2B	1.6GB	Quadro P1000 4GB or higher
Gemma 2	9B	5.4GB	RTX 3060 Ti 8GB or higher
Gemma 2	27B	16GB	RTX 4090, A5000 or higher
Phi-4	14B	9.1GB	RTX A4000 16GB or higher
Phi-3	3.8B	2.2GB	Quadro P1000 4GB or higher
Phi-3	14B	7.9GB	RTX A4000 16GB or higher
Llama 3.3	70B	43GB	A6000 48GB, A40 48GB, or higher
Llama 3.2	3B	2GB	Quadro P1000 4GB or higher
Llama 3.1	8B	4.9GB	GTX 1660 6GB or higher
Llama 3.1	70B	43GB	A6000 48GB, A40 48GB, or higher
Llama 3.1	405B	243GB	4xA100 80GB, or higher
Llama 3	8B	4.7GB	GTX 1660 6GB or higher
Llama 3	70B	40GB	A6000 48GB, A40 48GB, or higher
Mistral	7B	4.1GB	GTX 1660 6GB or higher
Mixtral	8x7B	26GB	A6000 48GB, A40 48GB, or higher
Mixtral	8x22B	80GB	2xA6000, 2xA100 80GB, or higher
LLaVA	7B	4.7GB	GTX 1660 6GB or higher
LLaVA	13B	8.0GB	RTX A4000 16GB or higher
LLaVA	34B	20GB	RTX 4090 24GB, A5000 24GB
Code Llama	7B	3.8GB	GTX 1660 6GB or higher
Code Llama	13B	7.4GB	RTX A4000 16GB or higher
Code Llama	34B	19GB	RTX 4090 24GB, A5000 24GB
Code Llama	70B	39GB	A6000 48GB, A40 48GB, or higher

Conclusion

Choosing the right GPU for LLMs on Ollama depends on your model size, VRAM requirements, and budget. Consumer GPUs like the RTX A4000 and 4090 are powerful and cost-effective, while enterprise solutions like the A100 and H100 offer unmatched performance for massive models. Ensure your GPU choice aligns with your specific use case to optimize efficiency and cost.

GPU Server Recommendation

Flash Sale to June 16

Professional GPU VPS - A4000

$ 99.00/mo

44% OFF Recurring (Was $179.00)

1mo3mo12mo24mo

Order Now

32GB RAM
24 CPU Cores
320GB SSD
300Mbps Unmetered Bandwidth

Once per 2 Weeks Backup
OS: Linux / Windows 10/ Windows 11
Dedicated GPU: Quadro RTX A4000
CUDA Cores: 6,144
Tensor Cores: 192
GPU Memory: 16GB GDDR6
FP32 Performance: 19.2 TFLOPS

Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Half-Year Pay, Full-Year Deal

Advanced GPU Dedicated Server - A5000

$ 191.90/mo

45% OFF Recurring (Was $349.00)

1mo3mo12mo24mo

Order Now

128GB RAM
Dual 12-Core E5-2697v2
240GB SSD + 2TB SSD
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A5000
Microarchitecture: Ampere
CUDA Cores: 8192
Tensor Cores: 256
GPU Memory: 24GB GDDR6
FP32 Performance: 27.8 TFLOPS

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 4090
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Enterprise GPU Dedicated Server - RTX A6000

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

$ 859.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual Gold 6148
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: 2 x GeForce RTX 5090
Microarchitecture: Ada Lovelace
CUDA Cores: 20,480
Tensor Cores: 680
GPU Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Enterprise GPU Dedicated Server - A100

$ 639.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

$ 1559.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 80GB HBM2e
FP32 Performance: 19.5 TFLOPS

Multi-GPU Dedicated Server - 4xA100

$ 1899.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 4 x Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Let us get back to you

If you can't find a suitable GPU Plan, or have a need to customize a GPU server, or have ideas for cooperation, please leave me a message. We will reach you back within 36 hours.

Email *

Name

Company

Message *

I agree to be contacted as per Database Mart privacy policy.

Choosing the Right NVIDIA GPU for LLMs on the Ollama Platform

Introduction

Understanding Model Size and VRAM Requirements

Popular LLMs and Their GPU Recommendations

Conclusion

GPU Server Recommendation