DeepSeek Hosting with Ollama — GPU Recommendation
Model Name | Size (4-bit Quantization) | Recommended GPUs | Tokens/s |
---|---|---|---|
deepseek-coder:1.3b | 776MB | P1000 < T1000 < GTX1650 < GTX1660 < RTX2060 | 28.9-50.32 |
deepSeek-r1:1.5B | 1.1GB | P1000 < T1000 < GTX1650 < GTX1660 < RTX2060 | 25.3-43.12 |
deepseek-coder:6.7b | 3.8GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 26.55-90.02 |
deepSeek-r1:7B | 4.7GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 26.70-87.10 |
deepSeek-r1:8B | 5.2GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 21.51-87.03 |
deepSeek-r1:14B | 9.0GB | A4000 < A5000 < V100 | 30.2-48.63 |
deepseek-v2:16B | 8.9GB | A4000 < A5000 < V100 | 22.89-69.16 |
deepSeek-r1:32B | 20GB | A5000 < RTX4090 < A100-40gb < RTX5090 | 24.21-45.51 |
deepseek-coder:33b | 19GB | A5000 < RTX4090 < A100-40gb < RTX5090 | 25.05-46.71 |
deepSeek-r1:70B | 43GB | A40 < A6000 < 2*A100-40gb < A100-80gb < H100 < 2*RTX5090 | 13.65-27.03 |
deepseek-v2:236B | 133GB | 2*A100-80gb < 2*H100 | -- |
deepSeek-r1:671B | 404GB | 6*A100-80gb < 6*H100 | -- |
deepseek-v3:671B | 404GB | 6*A100-80gb < 6*H100 | -- |
DeepSeek Hosting with vLLM + Hugging Face — GPU Recommendation
Model Name | Size (16-bit Quantization) | Recommended GPU(s) | Concurrent Requests | Tokens/s |
---|---|---|---|---|
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑1.5B | ~3GB | T1000 < RTX3060 < RTX4060 < 2*RTX3060 < 2*RTX4060 < A4000 < V100 | 50 | 1500-5000 |
deepseek-ai/deepseek‑coder‑6.7b‑instruct | ~13.4GB | A5000 < RTX4090 | 50 | 1375-4120 |
deepseek-ai/Janus‑Pro‑7B | ~14GB | A5000 < RTX4090 | 50 | 1333-4009 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑7B | ~14GB | A5000 < RTX4090 | 50 | 1333-4009 |
deepseek-ai/DeepSeek‑R1‑Distill‑Llama‑8B | ~16GB | 2*A4000 < 2*V100 < A5000 < RTX4090 | 50 | 1450-2769 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑14B | ~28GB | 3*V100 < 2*A5000 < A40 < A6000 < A100-40gb < 2*RTX4090 | 50 | 449-861 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑32B | ~65GB | A100-80gb < 2*A100-40gb < 2*A6000 < H100 | 50 | 577-1480 |
deepseek-ai/deepseek‑coder‑33b‑instruct | ~66GB | A100-80gb < 2*A100-40gb < 2*A6000 < H100 | 50 | 570-1470 |
deepseek-ai/DeepSeek‑R1‑Distill‑Llama‑70B | ~135GB | 4*A6000 | 50 | 466 |
deepseek-ai/DeepSeek‑Prover‑V2‑671B | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑V3 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑R1 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑R1‑0528 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑V3‑0324 | ~1350GB | -- | -- | -- |
- Recommended GPUs: From left to right, performance from low to high
- Tokens/s: from benchmark data.
Choose The Best GPU Plans for DeepSeek R1/V2/V3/Distill Hosting
- GPU Card Classify :
- GPU Server Price:
- GPU Use Scenario:
- GPU Memory:
- GPU Card Model:
Express GPU Dedicated Server - P1000
- 32GB RAM
- GPU: Nvidia Quadro P1000
- Eight-Core Xeon E5-2690
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Pascal
- CUDA Cores: 640
- GPU Memory: 4GB GDDR5
- FP32 Performance: 1.894 TFLOPS
Basic GPU Dedicated Server - T1000
- 64GB RAM
- GPU: Nvidia Quadro T1000
- Eight-Core Xeon E5-2690
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Turing
- CUDA Cores: 896
- GPU Memory: 8GB GDDR6
- FP32 Performance: 2.5 TFLOPS
Basic GPU Dedicated Server - GTX 1650
- 64GB RAM
- GPU: Nvidia GeForce GTX 1650
- Eight-Core Xeon E5-2667v3
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Turing
- CUDA Cores: 896
- GPU Memory: 4GB GDDR5
- FP32 Performance: 3.0 TFLOPS
Basic GPU Dedicated Server - GTX 1660
- 64GB RAM
- GPU: Nvidia GeForce GTX 1660
- Dual 10-Core Xeon E5-2660v2
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Turing
- CUDA Cores: 1408
- GPU Memory: 6GB GDDR6
- FP32 Performance: 5.0 TFLOPS
Advanced GPU Dedicated Server - V100
- 128GB RAM
- GPU: Nvidia V100
- Dual 12-Core E5-2690v3
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Professional GPU Dedicated Server - RTX 2060
- 128GB RAM
- GPU: Nvidia GeForce RTX 2060
- Dual 10-Core E5-2660v2
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 1920
- Tensor Cores: 240
- GPU Memory: 6GB GDDR6
- FP32 Performance: 6.5 TFLOPS
Advanced GPU Dedicated Server - RTX 2060
- 128GB RAM
- GPU: Nvidia GeForce RTX 2060
- Dual 20-Core Gold 6148
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 1920
- Tensor Cores: 240
- GPU Memory: 6GB GDDR6
- FP32 Performance: 6.5 TFLOPS
Advanced GPU Dedicated Server - RTX 3060 Ti
- 128GB RAM
- GPU: GeForce RTX 3060 Ti
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Professional GPU VPS - A4000
- 32GB RAM
- 24 CPU Cores
- 320GB SSD
- 300Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Quadro RTX A4000
- CUDA Cores: 6,144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Advanced GPU Dedicated Server - A4000
- 128GB RAM
- GPU: Nvidia Quadro RTX A4000
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Advanced GPU Dedicated Server - A5000
- 128GB RAM
- GPU: Nvidia Quadro RTX A5000
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Enterprise GPU Dedicated Server - A40
- 256GB RAM
- GPU: Nvidia A40
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 37.48 TFLOPS
Basic GPU Dedicated Server - RTX 5060
- 64GB RAM
- GPU: Nvidia GeForce RTX 5060
- 24-Core Platinum 8160
- 120GB SSD + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Blackwell 2.0
- CUDA Cores: 4608
- Tensor Cores: 144
- GPU Memory: 8GB GDDR7
- FP32 Performance: 23.22 TFLOPS
Enterprise GPU Dedicated Server - RTX 5090
- 256GB RAM
- GPU: GeForce RTX 5090
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Enterprise GPU Dedicated Server - A100
- 256GB RAM
- GPU: Nvidia A100
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
Enterprise GPU Dedicated Server - A100(80GB)
- 256GB RAM
- GPU: Nvidia A100
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 80GB HBM2e
- FP32 Performance: 19.5 TFLOPS
Enterprise GPU Dedicated Server - H100
- 256GB RAM
- GPU: Nvidia H100
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Hopper
- CUDA Cores: 14,592
- Tensor Cores: 456
- GPU Memory: 80GB HBM2e
- FP32 Performance: 183TFLOPS
Multi-GPU Dedicated Server- 2xRTX 4090
- 256GB RAM
- GPU: 2 x GeForce RTX 4090
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ada Lovelace
- CUDA Cores: 16,384
- Tensor Cores: 512
- GPU Memory: 24 GB GDDR6X
- FP32 Performance: 82.6 TFLOPS
Multi-GPU Dedicated Server- 2xRTX 5090
- 256GB RAM
- GPU: 2 x GeForce RTX 5090
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Multi-GPU Dedicated Server - 2xA100
- 256GB RAM
- GPU: Nvidia A100
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Free NVLink Included
Multi-GPU Dedicated Server - 2xRTX 3060 Ti
- 128GB RAM
- GPU: 2 x GeForce RTX 3060 Ti
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Multi-GPU Dedicated Server - 2xRTX 4060
- 64GB RAM
- GPU: 2 x Nvidia GeForce RTX 4060
- Eight-Core E5-2690
- 120GB SSD + 960GB SSD
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ada Lovelace
- CUDA Cores: 3072
- Tensor Cores: 96
- GPU Memory: 8GB GDDR6
- FP32 Performance: 15.11 TFLOPS
Multi-GPU Dedicated Server - 2xRTX A5000
- 128GB RAM
- GPU: 2 x Quadro RTX A5000
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Multi-GPU Dedicated Server - 2xRTX A4000
- 128GB RAM
- GPU: 2 x Nvidia RTX A4000
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Multi-GPU Dedicated Server - 3xRTX 3060 Ti
- 256GB RAM
- GPU: 3 x GeForce RTX 3060 Ti
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Multi-GPU Dedicated Server - 3xV100
- 256GB RAM
- GPU: 3 x Nvidia V100
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Multi-GPU Dedicated Server - 3xRTX A5000
- 256GB RAM
- GPU: 3 x Quadro RTX A5000
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Multi-GPU Dedicated Server - 3xRTX A6000
- 256GB RAM
- GPU: 3 x Quadro RTX A6000
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Multi-GPU Dedicated Server - 4xA100
- 512GB RAM
- GPU: 4 x Nvidia A100
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
Multi-GPU Dedicated Server - 4xRTX A6000
- 512GB RAM
- GPU: 4 x Quadro RTX A6000
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Multi-GPU Dedicated Server - 8xV100
- 512GB RAM
- GPU: 8 x Nvidia Tesla V100
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Multi-GPU Dedicated Server - 8xRTX A6000
- 512GB RAM
- GPU: 8 x Quadro RTX A6000
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- Single GPU Specifications:
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
What is DeepSeek Hosting?
DeepSeek Hosting enables users to serve, infer, or fine-tune DeepSeek models (like R1, V2, V3, or Distill variants) through either self-hosted environments or cloud-based APIs. DeepSeek Hosting Types include Self-Hosted Deployment and LLM-as-a-Service (LLMaaS).
✅ Self-hosted deployment means deploying on GPU servers (e.g. A100, 4090, H100) using inference engines such as vLLM, TGI, or Ollama, and users can control model files, batch processing, memory usage, and API logic
✅ LLM as a Service (LLMaaS) uses DeepSeek models through API providers, without deployment, just calling API.
LLM Benchmark Test Results for DeepSeek R1, V2, V3, and Distill Hosting
vLLM Benchmark for Deepseek
How to Deploy DeepSeek LLMs with Ollama/vLLM
Install and Run DeepSeek-R1 Locally with Ollama >
Install and Run DeepSeek-R1 Locally with vLLM v1 >
What Does DeepSeek Hosting Stack Include?
Model Backend (Inference Engine)
Model Format
Serving Infrastructure
Hardware (GPU Servers)
Why DeepSeek Hosting Needs a Specialized Hardware + Software Stack
DeepSeek Models Are Large and Compute-Intensive
Powerful GPUs Are Required
Efficient Inference Engines Are Critical
Scalable Infrastructure Is a Must
Self-hosted DeepSeek Hosting vs. DeepSeek LLM as a Service
Feature / Aspect | 🖥️ Self-hosted DeepSeek Hosting | ☁️ DeepSeek LLM as a Service (LLMaaS) |
---|---|---|
Deployment Location | On your own GPU server (e.g., A100, 4090, H100) | Cloud-based, via API platforms |
Model Control | ✅ Full control over weights, versions, updates | ❌ Limited — only exposed models via provider |
Customization | Full — supports fine-tuning, LoRA, quantization | None or minimal customization allowed |
Privacy & Data Security | ✅ Data stays local — ideal for sensitive data | ❌ Data sent to third-party cloud API |
Performance Tuning | Full control: batch size, concurrency, caching | Predefined, limited tuning |
Supported Models | Any DeepSeek model (R1, V2, V3, Distill, etc.) | Only what the provider offers |
Inference Engine Options | vLLM, TGI, Ollama, llama.cpp, custom stacks | Hidden — provider chooses backend |
Startup Time | Slower — requires setup and deployment | Instant — API ready to use |
Scalability | Requires infrastructure management | Scales automatically with provider's backend |
Cost Model | Higher upfront (hardware), lower at scale | Pay-per-call or token-based — predictable, but expensive at scale |
Use Case Fit | Ideal for R&D, private deployment, large workloads | Best for prototypes, demos, or small-scale usage |
Example Platforms | Dedicated GPU servers, on-premise clusters | DBM, Together.ai, OpenRouter.ai, Fireworks.ai, Groq |