Models | Qwen/Qwen2.5-VL-72B-Instruct | meta-llama/Meta-Llama-3-70B-Instruct | meta-llama/Llama-3.1-70B | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | Qwen/Qwen3-32B |
---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 137 | 132 | 132 | 132 | 65 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM |
Tensor Parallelism | 4 | 4 | 4 | 4 | 4 |
Request Numbers | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 68.59 | 63.96 | 65.50 | 66.46 | 33.11 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 25834 | 22252 | 24527 | 25994 | 22406 |
Request (req/s) | 0.73 | 0.78 | 0.76 | 0.75 | 1.51 |
Input (tokens/s) | 72.89 | 78.17 | 76.34 | 75.22 | 151.02 |
Output (tokens/s) | 376.62 | 347.93 | 374.45 | 391.10 | 676.72 |
Total Throughput (tokens/s) | 449.51 | 426.10 | 450.79 | 466.32 | 827.74 |
Median TTFT (ms) | 4134.85 | 4028.61 | 3945.48 | 4055.27 | 1958.54 |
P99 TTFT (ms) | 5269.88 | 5126.21 | 5019.25 | 5157.44 | 2495.06 |
Median TPOT (ms) | 107.41 | 102.25 | 102.61 | 104.10 | 52.33 |
P99 TPOT (ms) | 113.69 | 1743.43 | 1900.50 | 136.96 | 96.08 |
Median Eval Rate (tokens/s) | 9.31 | 9.78 | 9.74 | 9.61 | 19.10 |
P99 Eval Rate (tokens/s) | 8.79 | 0.57 | 0.56 | 7.30 | 10.41 |
GPU Setup | Total VRAM | Market Price | Can Run 70B+? | Typical Use Case |
---|---|---|---|---|
4×A6000 48GB | 192GB | ~$8,000–10,000 | ✅ Yes | Cost-effective inference |
2×A100 80GB | 160GB | ~$16,000–20,000 | ✅ Yes (tight fit) | High-end LLM deployment |
2×H100 80GB | 160GB | ~$25,000–30,000 | ✅ Yes | Ultra-performance research |
Model | Total Throughput (tokens/s) | Output Tokens/s | Median TTFT (ms) |
---|---|---|---|
Qwen3-32B | 827.74 | 676.72 | 1958.54 |
Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.
Multi-GPU Dedicated Server - 4xRTX A6000
Multi-GPU Dedicated Server - 8xRTX A6000
Multi-GPU Dedicated Server- 2xRTX 5090
Multi-GPU Dedicated Server - 4xA100
If you’re searching for the cheapest GPU configuration to run Hugging Face 70B or 72B models, 4×A6000 48GB (192GB total) delivers unbeatable price-to-performance. Whether you’re running Meta-Llama-3-70B, Qwen-72B, or DeepSeek, this setup supports all of them under vLLM, with throughput comparable to flagship GPUs but at a fraction of the cost.
4*A6000 vLLM benchmark, vLLM multi-GPU inference, cheapest GPU for 70B model inference, A6000 vs A100 vs H100 for LLM, best GPU for running llama3 70B, Qwen 72B inference setup, hugging face 72B model GPU requirements, vLLM performance 70B model, affordable large language model deployment