GPU Server Promotion, Up to 59% OFF, Order Now>



4×A100 vs. 4×A6000 for 72B LLM Inference: Hugging Face Benchmarks Reveal the Limits of GPU Memory

As large language models (LLMs) like Qwen2.5-VL-72B and Meta-Llama-3-70B grow in size and complexity, the hardware required to serve them efficiently is becoming increasingly demanding. A natural comparison arises between two widely available GPU setups: 4×A100 (40GB each, total 160GB) vs. 4×A6000 (48GB each, total 192GB). In this benchmark, we evaluate their performance using Hugging Face’s 72B models (~137GB) and assess how memory availability impacts speed and responsiveness.

Test Overview

1. Test Project Code Source:

We used this git project to build the environment（https://github.com/vllm-project/vllm）

2. The Following Models from Hugging Face were Tested:

Qwen/Qwen2.5-VL-72B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
meta-llama/Llama-3.1-70B
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Qwen/Qwen3-32B

3. The Online Test Parameters are Preset as Follows:

--random-input-len 100
--random-output-len 600
--num-prompts 50
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--max-model-len 2048

4×A100 vLLM Benchmark for 50 Concurrent Requests

Models	Qwen/Qwen2.5-VL-72B-Instruct	meta-llama/Meta-Llama-3-70B-Instruct	meta-llama/Llama-3.1-70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	Qwen/Qwen3-32B
Quantization	16	16	16	16	16
Size（GB）	137	132	132	132	65
Backend/Platform	vLLM	vLLM	vLLM	vLLM	vLLM
Tensor Parallelism	4	4	4	4	4
Request Numbers	50	50	50	50	50
Benchmark Duration(s)	183.21	95.56	86.93	93.62	25.92
Total Input Tokens	5000	4950	4950	4950	5000
Total Generated Tokens	23317	23290	24628	27379	23687
Request (req/s)	0.27	0.52	0.58	0.53	1.93
Input (tokens/s)	27.29	51.8	56.94	52.88	192.88
Output (tokens/s)	127.27	243.72	283.31	292.43	913.73
Total Throughput (tokens/s)	154.56	295.52	340.25	345.31	1106.61
Median TTFT (ms)	3077.27	2663.16	2662.94	2426.34	1352.90
P99 TTFT (ms)	124922.31	3213.19	3215.87	3283.82	1627.84
Median TPOT (ms)	92.95	106.44	104.99	105.87	41.92
P99 TPOT (ms)	543.76	754.05	912.88	220.63	52.77
Median Eval Rate (tokens/s)	10.76	9.39	9.52	9.44	23.85
P99 Eval Rate (tokens/s)	1.84	1.35	1.10	4.53	18.95

Check 4×A6000 vLLM Benchmark Results Here >

4xA6000 Outperforms 4xA100 Across All 70B Models

GPU Setup	Total VRAM	Market Price	Total Throughput (tokens/s)	Can Run 70-72B?
4×A6000 48GB	192GB	~$8,000–10,000	~450	✅ Yes
4×A100 40GB	160GB	~$16,000–20,000	~150	✅ Yes (not recommended)

Despite the A100’s stronger FP16 compute power, the 160GB VRAM ceiling severely bottlenecks performance. The inference throughput for Qwen2.5-VL-72B, for example, is nearly 3× faster on A6000s (449.51 vs. 154.56 tokens/s). Similar trends are seen for Meta-Llama-3 and DeepSeek models.

Reason: The 72B model (~137GB) just fits into the A100’s 160GB total memory — leaving minimal room for KV cache and resulting in slow response times and high latency. Median TPOT for Qwen2.5 on A100 is 92.95 ms vs. 107.41 ms on A6000, but P99 latency explodes on A100 due to memory exhaustion.

4xA100 Performs Better with Smaller Models (32B)

When switching to a smaller model like Qwen3-32B (65GB), the memory pressure eases on A100s, and the compute advantage begins to show:

GPU	Model	Total Throughput(tokens/s)	Median TPOT(ms)	Eval Rate(tokens/s)
4×A6000 48GB	Qwen3-32B(65GB)	827.74	52.33	19.10
4×A100 40GB	Qwen3-32B(65GB)	1106.61	41.92	23.85

This flip indicates that the bottleneck on A100 is not compute, but VRAM availability. For smaller LLMs, A100's full potential is unleashed.

Get Started with Multi-GPU Server Hosting

Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.

Multi-GPU Dedicated Server - 4xRTX A6000

$ 1199.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 4 x Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Multi-GPU Dedicated Server - 8xRTX A6000

$ 2099.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 8 x Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

$ 999.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual Gold 6148
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: 2 x GeForce RTX 5090
Microarchitecture: Ada Lovelace
CUDA Cores: 20,480
Tensor Cores: 680
GPU Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Multi-GPU Dedicated Server - 4xA100

$ 1899.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 4 x Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Conclusion

The choice between A100 and A6000 GPUs should be guided by model size:

For 70B+ models (137GB): Go with 4×A6000. The extra 32GB memory unlocks significantly better throughput and lower latency.
For sub-32B models: 4×A100s can outperform A6000s due to their superior compute, assuming memory is sufficient.

If you're deploying large Hugging Face models like Qwen2.5-VL-72B or Meta-Llama-3, don’t be misled by raw TFLOPs — memory bandwidth and capacity are king in real-world inference.

Attachment: Video Recording of 4*A100 vLLM Benchmark

Attachment: Video Recording of 4*A6000 vLLM Benchmark

Tags:

LLM inference benchmark, 72B model GPU test, 4xA100 vs 4xA6000, Qwen2.5 VL-72B, Meta-Llama-3 70B, GPU memory limits, Hugging Face large models, vLLM inference speed