Choosing the Right NVIDIA GPU for LLMs on the Ollama Platform

Explore the best NVIDIA GPUs for LLMs on the Ollama Platform. Our detailed insights will help you make an informed decision for superior performance.

Introduction

Large Language Models (LLMs) require substantial GPU power for efficient inference and fine-tuning. If you're running models on the Ollama platform, selecting the right NVIDIA GPU is crucial for performance and cost-effectiveness. This guide explores the relationship between model sizes and GPU memory requirements and recommends the best NVIDIA GPUs for different workloads.

Understanding Model Size and VRAM Requirements

The size of an LLM is typically measured in parameters, which can range from hundreds of millions to hundreds of billions. The VRAM (Video Random Access Memory) required to run these models efficiently depends on the model's size and the precision of the computations (e.g., FP32, FP16, or INT8).

The Ollama library is designed to optimize the deployment and running of large language models (LLMs) efficiently, especially on consumer-grade hardware. While not all models in the Ollama library are strictly 4-bit quantized, many of them are optimized using quantization techniques, including 4-bit quantization, to reduce their memory footprint and computational requirements.

General Rule of Thumb:

Tiny Models (100M - 2B parameters): These models can often run on consumer-grade GPUs with 2-4GB of VRAM.

Small Models (2B - 10B parameters): These models can often run on consumer-grade GPUs with 6-16GB of VRAM.

Medium Models (10B - 20B parameters): These models typically require 16-24GB of VRAM.

Large Models (20B - 70B parameters): These models need high-end GPUs with 24-48GB of VRAM.

Very Large Models (70B - 110B parameters): These models need high-end GPUs with 80GB+ of VRAM.

Super Large Models (110B+ parameters): These models often require multiple high-end GPUs with 80GB+ of VRAM each.


Note: To run LLMs efficiently, the GPU memory requirement will be slightly higher than the model size (e.g. 1.2x), because additional memory is needed to store intermediate calculation results, optimizer state (if training), and input data.

Popular LLMs and Their GPU Recommendations

Model NameParamsModel SizeRecommended GPU cards
DeepSeek R11.5B1.1GBK620 2GB or higher
DeepSeek R17B4.7GBGTX 1660 6GB or higher
DeepSeek R18B4.9GBGTX 1660 6GB or higher
DeepSeek R114B9.0GBRTX A4000 16GB or higher
DeepSeek R132B20GBRTX 4090, RTX A5000 24GB, A100 40GB
DeepSeek R170B43GBRTX A6000, A40 48GB, 2xRTX 4090
DeepSeek R1671B404GBContact us, or leave a message below
Deepseek-coder-v216B8.9GBRTX A4000 16GB or higher
Deepseek-coder-v2236B133GB2xA100 80GB, 4xA6000 48GB
Deepseek-coder33B19GBRTX 4090 24GB, RTX A5000 24GB
Deepseek-coder6.7B3.8GBGTX 1660 6GB or higher
Qwen2.50.5B398MBK620 2GB
Qwen2.51.5B986MBK620 2GB
Qwen2.53B1.9GBQuadro P1000 4GB or higher
Qwen2.57B4.7GBGTX 1660 6GB or higher
Qwen2.514B9GBRTX A4000 16GB or higher
Qwen2.532B20GBRTX 4090 24GB, RTX A5000 24GB
Qwen2.572B47GB3xRTX A5000, A100 80GB, H100
Qwen 2.5 Coder7B4.7GBGTX 1660 6GB or higher
Qwen 2.5 Coder14B9.0GBRTX A4000 16GB or higher
Qwen 2.5 Coder32B20GBRTX 4090 24GB, RTX A5000 24GB or higher
Qwen 272B41GBRTX A6000 48GB, A40 48GB or higher
Qwen 27B4.4GBGTX 1660 6GB or higher
Qwen 1.57B4.5GBGTX 1660 6GB or higher
Qwen 1.57B4.5GBGTX 1660 6GB or higher
Qwen 1.514B8.2GBRTX A4000 16GB or higher
Qwen 1.532B18GBRTX 4090 24GB, A5000 24GB
Qwen 1.572B41GBRTX A6000 48GB, A40 48GB
Qwen 1.5110B63GBA100 80GB, H100
Gemma 22B1.6GBQuadro P1000 4GB or higher
Gemma 29B5.4GBRTX 3060 Ti 8GB or higher
Gemma 227B16GBRTX 4090, A5000 or higher
Phi-414B9.1GBRTX A4000 16GB or higher
Phi-33.8B2.2GBQuadro P1000 4GB or higher
Phi-314B7.9GBRTX A4000 16GB or higher
Llama 3.370B43GBA6000 48GB, A40 48GB, or higher
Llama 3.23B2GBQuadro P1000 4GB or higher
Llama 3.18B4.9GBGTX 1660 6GB or higher
Llama 3.170B43GBA6000 48GB, A40 48GB, or higher
Llama 3.1405B243GB4xA100 80GB, or higher
Llama 38B4.7GBGTX 1660 6GB or higher
Llama 370B40GBA6000 48GB, A40 48GB, or higher
Mistral7B4.1GBGTX 1660 6GB or higher
Mixtral8x7B26GBA6000 48GB, A40 48GB, or higher
Mixtral8x22B80GB2xA6000, 2xA100 80GB, or higher
LLaVA7B4.7GBGTX 1660 6GB or higher
LLaVA13B8.0GBRTX A4000 16GB or higher
LLaVA34B20GBRTX 4090 24GB, A5000 24GB
Code Llama7B3.8GBGTX 1660 6GB or higher
Code Llama13B7.4GBRTX A4000 16GB or higher
Code Llama34B19GBRTX 4090 24GB, A5000 24GB
Code Llama70B39GBA6000 48GB, A40 48GB, or higher

Conclusion

Choosing the right GPU for LLMs on Ollama depends on your model size, VRAM requirements, and budget. Consumer GPUs like the RTX A4000 and 4090 are powerful and cost-effective, while enterprise solutions like the A100 and H100 offer unmatched performance for massive models. Ensure your GPU choice aligns with your specific use case to optimize efficiency and cost.

GPU Server Recommendation

Flash Sale to Mar.16

Professional GPU VPS - A4000

102.00/mo
43% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 32GB RAM
  • 24 CPU Cores
  • 320GB SSD
  • 300Mbps Unmetered Bandwidth
  • Once per 2 Weeks Backup
  • OS: Linux / Windows 10
  • Dedicated GPU: Quadro RTX A4000
  • CUDA Cores: 6,144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPS
  • Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Advanced GPU Dedicated Server - A5000

349.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A5000
  • Microarchitecture: Ampere
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPS
  • $174.5 first month, then enjoy a 20% discount for renewals.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - RTX A6000

384.00/mo
30% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
New Arrival

Multi-GPU Dedicated Server- 4xRTX 5090

999.00/mo
1mo3mo12mo24mo
  • 512GB RAM
  • Dual 22-Core E5-2699v4
  • 240GB SSD + 4TB NVMe + 16TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 4 x GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • GPU Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Enterprise GPU Dedicated Server - A100

639.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Multi-GPU Dedicated Server - 4xA100

1899.00/mo
1mo3mo12mo24mo
Order Now
  • 512GB RAM
  • Dual 22-Core E5-2699v4
  • 240GB SSD + 4TB NVMe + 16TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 4 x Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
Let us get back to you

If you can't find a suitable GPU Plan, or have a need to customize a GPU server, or have ideas for cooperation, please leave me a message. We will reach you back within 36 hours.

Email *
Name
Company
Message *
I agree to be contacted as per Database Mart privacy policy.
pv:,uv: