Test Overview
1. Test Project Code Source:
- We used this git project to build the environment(https://github.com/vllm-project/vllm)
2. The Following Models from Hugging Face were Tested:
- Qwen/Qwen2.5-VL-72B-Instruct
- meta-llama/Meta-Llama-3-70B-Instruct
- meta-llama/Llama-3.1-70B
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- Qwen/Qwen3-32B
3. The Online Test Parameters are Preset as Follows:
- --random-input-len 100
- --random-output-len 600
- --num-prompts 50
- --tensor-parallel-size 4
- --gpu-memory-utilization 0.95
- --max-model-len 2048
4×A100 vLLM Benchmark for 50 Concurrent Requests
Models | Qwen/Qwen2.5-VL-72B-Instruct | meta-llama/Meta-Llama-3-70B-Instruct | meta-llama/Llama-3.1-70B | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | Qwen/Qwen3-32B |
---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 137 | 132 | 132 | 132 | 65 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM |
Tensor Parallelism | 4 | 4 | 4 | 4 | 4 |
Request Numbers | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 183.21 | 95.56 | 86.93 | 93.62 | 25.92 |
Total Input Tokens | 5000 | 4950 | 4950 | 4950 | 5000 |
Total Generated Tokens | 23317 | 23290 | 24628 | 27379 | 23687 |
Request (req/s) | 0.27 | 0.52 | 0.58 | 0.53 | 1.93 |
Input (tokens/s) | 27.29 | 51.8 | 56.94 | 52.88 | 192.88 |
Output (tokens/s) | 127.27 | 243.72 | 283.31 | 292.43 | 913.73 |
Total Throughput (tokens/s) | 154.56 | 295.52 | 340.25 | 345.31 | 1106.61 |
Median TTFT (ms) | 3077.27 | 2663.16 | 2662.94 | 2426.34 | 1352.90 |
P99 TTFT (ms) | 124922.31 | 3213.19 | 3215.87 | 3283.82 | 1627.84 |
Median TPOT (ms) | 92.95 | 106.44 | 104.99 | 105.87 | 41.92 |
P99 TPOT (ms) | 543.76 | 754.05 | 912.88 | 220.63 | 52.77 |
Median Eval Rate (tokens/s) | 10.76 | 9.39 | 9.52 | 9.44 | 23.85 |
P99 Eval Rate (tokens/s) | 1.84 | 1.35 | 1.10 | 4.53 | 18.95 |
4xA6000 Outperforms 4xA100 Across All 70B Models
GPU Setup | Total VRAM | Market Price | Total Throughput (tokens/s) | Can Run 70-72B? |
---|---|---|---|---|
4×A6000 48GB | 192GB | ~$8,000–10,000 | ~450 | ✅ Yes |
4×A100 40GB | 160GB | ~$16,000–20,000 | ~150 | ✅ Yes (not recommended) |
Despite the A100’s stronger FP16 compute power, the 160GB VRAM ceiling severely bottlenecks performance. The inference throughput for Qwen2.5-VL-72B, for example, is nearly 3× faster on A6000s (449.51 vs. 154.56 tokens/s). Similar trends are seen for Meta-Llama-3 and DeepSeek models.
- Reason: The 72B model (~137GB) just fits into the A100’s 160GB total memory — leaving minimal room for KV cache and resulting in slow response times and high latency. Median TPOT for Qwen2.5 on A100 is 92.95 ms vs. 107.41 ms on A6000, but P99 latency explodes on A100 due to memory exhaustion.
4xA100 Performs Better with Smaller Models (32B)
When switching to a smaller model like Qwen3-32B (65GB), the memory pressure eases on A100s, and the compute advantage begins to show:
GPU | Model | Total Throughput(tokens/s) | Median TPOT(ms) | Eval Rate(tokens/s) |
---|---|---|---|---|
4×A6000 48GB | Qwen3-32B(65GB) | 827.74 | 52.33 | 19.10 |
4×A100 40GB | Qwen3-32B(65GB) | 1106.61 | 41.92 | 23.85 |
This flip indicates that the bottleneck on A100 is not compute, but VRAM availability. For smaller LLMs, A100's full potential is unleashed.
Get Started with Multi-GPU Server Hosting
Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.
Multi-GPU Dedicated Server - 4xRTX A6000
$ 1199.00/mo
1mo3mo12mo24mo
Order Now- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 4 x Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
Multi-GPU Dedicated Server - 8xRTX A6000
$ 2099.00/mo
1mo3mo12mo24mo
Order Now- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 8 x Quadro RTX A6000
- Microarchitecture: Ampere
- CUDA Cores: 10,752
- Tensor Cores: 336
- GPU Memory: 48GB GDDR6
- FP32 Performance: 38.71 TFLOPS
New Arrival
Multi-GPU Dedicated Server- 2xRTX 5090
$ 999.00/mo
1mo3mo12mo24mo
Order Now- 256GB RAM
- Dual Gold 6148
- 240GB SSD + 2TB NVMe + 8TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 2 x GeForce RTX 5090
- Microarchitecture: Ada Lovelace
- CUDA Cores: 20,480
- Tensor Cores: 680
- GPU Memory: 32 GB GDDR7
- FP32 Performance: 109.7 TFLOPS
Multi-GPU Dedicated Server - 4xA100
$ 1899.00/mo
1mo3mo12mo24mo
Order Now- 512GB RAM
- Dual 22-Core E5-2699v4
- 240GB SSD + 4TB NVMe + 16TB SATA
- 1Gbps
- OS: Windows / Linux
- GPU: 4 x Nvidia A100
- Microarchitecture: Ampere
- CUDA Cores: 6912
- Tensor Cores: 432
- GPU Memory: 40GB HBM2
- FP32 Performance: 19.5 TFLOPS
Conclusion
The choice between A100 and A6000 GPUs should be guided by model size:
- For 70B+ models (137GB): Go with 4×A6000. The extra 32GB memory unlocks significantly better throughput and lower latency.
- For sub-32B models: 4×A100s can outperform A6000s due to their superior compute, assuming memory is sufficient.
If you're deploying large Hugging Face models like Qwen2.5-VL-72B or Meta-Llama-3, don’t be misled by raw TFLOPs — memory bandwidth and capacity are king in real-world inference.
Attachment: Video Recording of 4*A100 vLLM Benchmark
Attachment: Video Recording of 4*A6000 vLLM Benchmark
Tags:
LLM inference benchmark, 72B model GPU test, 4xA100 vs 4xA6000, Qwen2.5 VL-72B, Meta-Llama-3 70B, GPU memory limits, Hugging Face large models, vLLM inference speed