DeepSeek Hosting with Ollama — GPU Recommendation
Model Name | Size (4-bit Quantization) | Recommended GPUs | Tokens/s |
---|---|---|---|
deepseek-coder:1.3b | 776MB | P1000 < T1000 < GTX1650 < GTX1660 < RTX2060 | 28.9-50.32 |
deepSeek-r1:1.5B | 1.1GB | P1000 < T1000 < GTX1650 < GTX1660 < RTX2060 | 25.3-43.12 |
deepseek-coder:6.7b | 3.8GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 26.55-90.02 |
deepSeek-r1:7B | 4.7GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 26.70-87.10 |
deepSeek-r1:8B | 5.2GB | T1000 < RTX3060 Ti < RTX4060 < A4000 < RTX5060 < V100 | 21.51-87.03 |
deepSeek-r1:14B | 9.0GB | A4000 < A5000 < V100 | 30.2-48.63 |
deepseek-v2:16B | 8.9GB | A4000 < A5000 < V100 | 22.89-69.16 |
deepSeek-r1:32B | 20GB | A5000 < RTX4090 < A100-40gb < RTX5090 | 24.21-45.51 |
deepseek-coder:33b | 19GB | A5000 < RTX4090 < A100-40gb < RTX5090 | 25.05-46.71 |
deepSeek-r1:70B | 43GB | A40 < A6000 < 2*A100-40gb < A100-80gb < H100 < 2*RTX5090 | 13.65-27.03 |
deepseek-v2:236B | 133GB | 2*A100-80gb < 2*H100 | -- |
deepSeek-r1:671B | 404GB | 6*A100-80gb < 6*H100 | -- |
deepseek-v3:671B | 404GB | 6*A100-80gb < 6*H100 | -- |
DeepSeek Hosting with vLLM + Hugging Face — GPU Recommendation
Model Name | Size (16-bit Quantization) | Recommended GPU(s) | Concurrent Requests | Tokens/s |
---|---|---|---|---|
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑1.5B | ~3GB | T1000 < RTX3060 < RTX4060 < 2*RTX3060 < 2*RTX4060 < A4000 < V100 | 50 | 1500-5000 |
deepseek-ai/deepseek‑coder‑6.7b‑instruct | ~13.4GB | A5000 < RTX4090 | 50 | 1375-4120 |
deepseek-ai/Janus‑Pro‑7B | ~14GB | A5000 < RTX4090 | 50 | 1333-4009 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑7B | ~14GB | A5000 < RTX4090 | 50 | 1333-4009 |
deepseek-ai/DeepSeek‑R1‑Distill‑Llama‑8B | ~16GB | 2*A4000 < 2*V100 < A5000 < RTX4090 | 50 | 1450-2769 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑14B | ~28GB | 3*V100 < 2*A5000 < A40 < A6000 < A100-40gb < 2*RTX4090 | 50 | 449-861 |
deepseek-ai/DeepSeek‑R1‑Distill‑Qwen‑32B | ~65GB | A100-80gb < 2*A100-40gb < 2*A6000 < H100 | 50 | 577-1480 |
deepseek-ai/deepseek‑coder‑33b‑instruct | ~66GB | A100-80gb < 2*A100-40gb < 2*A6000 < H100 | 50 | 570-1470 |
deepseek-ai/DeepSeek‑R1‑Distill‑Llama‑70B | ~135GB | 4*A6000 | 50 | 466 |
deepseek-ai/DeepSeek‑Prover‑V2‑671B | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑V3 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑R1 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑R1‑0528 | ~1350GB | -- | -- | -- |
deepseek-ai/DeepSeek‑V3‑0324 | ~1350GB | -- | -- | -- |
- Recommended GPUs: From left to right, performance from low to high
- Tokens/s: from benchmark data.
Choose The Best GPU Plans for DeepSeek R1/V2/V3/Distill Hosting
- GPU Card Classify :
- GPU Server Price:
- GPU Use Scenario:
- GPU Memory:
- GPU Card Model:
Express GPU Dedicated Server - P1000
- 32GB RAM
- Eight-Core Xeon E5-2690
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro P1000
- Microarchitecture: Pascal
- CUDA Cores: 640
- GPU Memory: 4GB GDDR5
- FP32 Performance: 1.894 TFLOPS
Basic GPU Dedicated Server - T1000
- 64GB RAM
- Eight-Core Xeon E5-2690
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro T1000
- Microarchitecture: Turing
- CUDA Cores: 896
- GPU Memory: 8GB GDDR6
- FP32 Performance: 2.5 TFLOPS
Basic GPU Dedicated Server - GTX 1650
- 64GB RAM
- Eight-Core Xeon E5-2667v3
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia GeForce GTX 1650
- Microarchitecture: Turing
- CUDA Cores: 896
- GPU Memory: 4GB GDDR5
- FP32 Performance: 3.0 TFLOPS
Basic GPU Dedicated Server - GTX 1660
- 64GB RAM
- Dual 10-Core Xeon E5-2660v2
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia GeForce GTX 1660
- Microarchitecture: Turing
- CUDA Cores: 1408
- GPU Memory: 6GB GDDR6
- FP32 Performance: 5.0 TFLOPS
Advanced GPU Dedicated Server - V100
- 128GB RAM
- Dual 12-Core E5-2690v3
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia V100
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Professional GPU Dedicated Server - RTX 2060
- 128GB RAM
- Dual 10-Core E5-2660v2
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia GeForce RTX 2060
- Microarchitecture: Ampere
- CUDA Cores: 1920
- Tensor Cores: 240
- GPU Memory: 6GB GDDR6
- FP32 Performance: 6.5 TFLOPS
Advanced GPU Dedicated Server - RTX 2060
- 128GB RAM
- Dual 20-Core Gold 6148
- 120GB + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia GeForce RTX 2060
- Microarchitecture: Ampere
- CUDA Cores: 1920
- Tensor Cores: 240
- GPU Memory: 6GB GDDR6
- FP32 Performance: 6.5 TFLOPS
Advanced GPU Dedicated Server - RTX 3060 Ti
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: GeForce RTX 3060 Ti
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Professional GPU VPS - A4000
- 32GB RAM
- 24 CPU Cores
- 320GB SSD
- 300Mbps Unmetered Bandwidth
- Once per 2 Weeks Backup
- OS: Linux / Windows 10/ Windows 11
- Dedicated GPU: Quadro RTX A4000
- CUDA Cores: 6,144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Advanced GPU Dedicated Server - A4000
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A4000
- Microarchitecture: Ampere
- CUDA Cores: 6144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Advanced GPU Dedicated Server - A5000
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia Quadro RTX A5000
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Enterprise GPU Dedicated Server - A40
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A40
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 37.48 TFLOPS
Basic GPU Dedicated Server - RTX 5060
- 64GB RAM
- 24-Core Platinum 8160
- 120GB SSD + 960GB SSD
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia GeForce RTX 5060
- Microarchitecture: Blackwell 2.0
- CUDA Cores: 4608
- Tensor Cores: 144
- GPU Memory: 8GB GDDR7
- FP32 Performance: 23.22 TFLOPS
Enterprise GPU Dedicated Server - RTX 5090
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Enterprise GPU Dedicated Server - A100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
Enterprise GPU Dedicated Server - A100(80GB)
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 80GB HBM2e
- FP32 Performance: 19.5 TFLOPS
Enterprise GPU Dedicated Server - H100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 100Mbps-1Gbps
- OS: Windows / Linux
- GPU: Nvidia H100
- Microarchitecture: Hopper
- CUDA Cores: 14,592
- Tensor Cores: 456
- GPU Memory: 80GB HBM2e
- FP32 Performance: 183TFLOPS
Multi-GPU Dedicated Server- 2xRTX 4090
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 4090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 16,384
- Tensor Cores: 512
- GPU Memory: 24 GB GDDR6X
- FP32 Performance: 82.6 TFLOPS
Multi-GPU Dedicated Server- 2xRTX 5090
- 256GB RAM
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Multi-GPU Dedicated Server - 2xA100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
- Free NVLink Included
Multi-GPU Dedicated Server - 2xRTX 3060 Ti
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 3060 Ti
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Multi-GPU Dedicated Server - 2xRTX 4060
- 64GB RAM
- Eight-Core E5-2690
- 120GB SSD + 960GB SSD
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x Nvidia GeForce RTX 4060
- Microarchitecture: Ada Lovelace
- CUDA Cores: 3072
- Tensor Cores: 96
- GPU Memory: 8GB GDDR6
- FP32 Performance: 15.11 TFLOPS
Multi-GPU Dedicated Server - 2xRTX A5000
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x Quadro RTX A5000
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Multi-GPU Dedicated Server - 2xRTX A4000
- 128GB RAM
- Dual 12-Core E5-2697v2
- 240GB SSD + 2TB SSD
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x Nvidia RTX A4000
- Microarchitecture: Ampere
- CUDA Cores: 6144
- Tensor Cores: 192
- GPU Memory: 16GB GDDR6
- FP32 Performance: 19.2 TFLOPS
Multi-GPU Dedicated Server - 3xRTX 3060 Ti
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 3 x GeForce RTX 3060 Ti
- Microarchitecture: Ampere
- CUDA Cores: 4864
- Tensor Cores: 152
- GPU Memory: 8GB GDDR6
- FP32 Performance: 16.2 TFLOPS
Multi-GPU Dedicated Server - 3xV100
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 3 x Nvidia V100
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Multi-GPU Dedicated Server - 3xRTX A5000
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 3 x Quadro RTX A5000
- Microarchitecture: Ampere
- CUDA Cores: 8192
- Tensor Cores: 256
- GPU Memory: 24GB GDDR6
- FP32 Performance: 27.8 TFLOPS
Multi-GPU Dedicated Server - 3xRTX A6000
- 256GB RAM
- Dual 18-Core E5-2697v4
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 3 x Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Multi-GPU Dedicated Server - 4xA100
- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 4 x Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
Multi-GPU Dedicated Server - 4xRTX A6000
- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 4 x Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Multi-GPU Dedicated Server - 8xV100
- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 8 x Nvidia Tesla V100
- Microarchitecture: Volta
- CUDA Cores: 5,120
- Tensor Cores: 640
- GPU Memory: 16GB HBM2
- FP32 Performance: 14 TFLOPS
Multi-GPU Dedicated Server - 8xRTX A6000
- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 8 x Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
What is DeepSeek Hosting?
DeepSeek Hosting enables users to serve, infer, or fine-tune DeepSeek models (like R1, V2, V3, or Distill variants) through either self-hosted environments or cloud-based APIs. DeepSeek Hosting Types include Self-Hosted Deployment and LLM-as-a-Service (LLMaaS).
✅ Self-hosted deployment means deploying on GPU servers (e.g. A100, 4090, H100) using inference engines such as vLLM, TGI, or Ollama, and users can control model files, batch processing, memory usage, and API logic
✅ LLM as a Service (LLMaaS) uses DeepSeek models through API providers, without deployment, just calling API.
LLM Benchmark Test Results for DeepSeek R1, V2, V3, and Distill Hosting
vLLM Benchmark for Deepseek
How to Deploy DeepSeek LLMs with Ollama/vLLM
Install and Run DeepSeek-R1 Locally with Ollama >
Install and Run DeepSeek-R1 Locally with vLLM v1 >
What Does DeepSeek Hosting Stack Include?
Model Backend (Inference Engine)
Model Format
Serving Infrastructure
Hardware (GPU Servers)
Why DeepSeek Hosting Needs a Specialized Hardware + Software Stack
DeepSeek Models Are Large and Compute-Intensive
Powerful GPUs Are Required
Efficient Inference Engines Are Critical
Scalable Infrastructure Is a Must
Self-hosted DeepSeek Hosting vs. DeepSeek LLM as a Service
Feature / Aspect | 🖥️ Self-hosted DeepSeek Hosting | ☁️ DeepSeek LLM as a Service (LLMaaS) |
---|---|---|
Deployment Location | On your own GPU server (e.g., A100, 4090, H100) | Cloud-based, via API platforms |
Model Control | ✅ Full control over weights, versions, updates | ❌ Limited — only exposed models via provider |
Customization | Full — supports fine-tuning, LoRA, quantization | None or minimal customization allowed |
Privacy & Data Security | ✅ Data stays local — ideal for sensitive data | ❌ Data sent to third-party cloud API |
Performance Tuning | Full control: batch size, concurrency, caching | Predefined, limited tuning |
Supported Models | Any DeepSeek model (R1, V2, V3, Distill, etc.) | Only what the provider offers |
Inference Engine Options | vLLM, TGI, Ollama, llama.cpp, custom stacks | Hidden — provider chooses backend |
Startup Time | Slower — requires setup and deployment | Instant — API ready to use |
Scalability | Requires infrastructure management | Scales automatically with provider's backend |
Cost Model | Higher upfront (hardware), lower at scale | Pay-per-call or token-based — predictable, but expensive at scale |
Use Case Fit | Ideal for R&D, private deployment, large workloads | Best for prototypes, demos, or small-scale usage |
Example Platforms | Dedicated GPU servers, on-premise clusters | DBM, Together.ai, OpenRouter.ai, Fireworks.ai, Groq |