4×A100 vs. 4×A6000 for 72B LLM Inference: Hugging Face Benchmarks Reveal the Limits of GPU Memory

As large language models (LLMs) like Qwen2.5-VL-72B and Meta-Llama-3-70B grow in size and complexity, the hardware required to serve them efficiently is becoming increasingly demanding. A natural comparison arises between two widely available GPU setups: 4×A100 (40GB each, total 160GB) vs. 4×A6000 (48GB each, total 192GB). In this benchmark, we evaluate their performance using Hugging Face’s 72B models (~137GB) and assess how memory availability impacts speed and responsiveness.

Test Overview

1. Test Project Code Source:

2. The Following Models from Hugging Face were Tested:

  • Qwen/Qwen2.5-VL-72B-Instruct
  • meta-llama/Meta-Llama-3-70B-Instruct
  • meta-llama/Llama-3.1-70B
  • deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  • Qwen/Qwen3-32B

3. The Online Test Parameters are Preset as Follows:

  • --random-input-len 100
  • --random-output-len 600
  • --num-prompts 50
  • --tensor-parallel-size 4
  • --gpu-memory-utilization 0.95
  • --max-model-len 2048

4×A100 vLLM Benchmark for 50 Concurrent Requests

ModelsQwen/Qwen2.5-VL-72B-Instructmeta-llama/Meta-Llama-3-70B-Instructmeta-llama/Llama-3.1-70Bdeepseek-ai/DeepSeek-R1-Distill-Llama-70BQwen/Qwen3-32B
Quantization1616161616
Size(GB)13713213213265
Backend/PlatformvLLMvLLMvLLMvLLMvLLM
Tensor Parallelism44444
Request Numbers5050505050
Benchmark Duration(s)183.2195.5686.9393.6225.92
Total Input Tokens50004950495049505000
Total Generated Tokens2331723290246282737923687
Request (req/s)0.270.520.580.531.93
Input (tokens/s)27.2951.856.9452.88192.88
Output (tokens/s)127.27243.72283.31292.43913.73
Total Throughput (tokens/s)154.56295.52340.25345.311106.61
Median TTFT (ms)3077.272663.162662.942426.341352.90
P99 TTFT (ms)124922.313213.193215.873283.821627.84
Median TPOT (ms)92.95106.44104.99105.8741.92
P99 TPOT (ms)543.76754.05912.88220.6352.77
Median Eval Rate (tokens/s)10.769.399.529.4423.85
P99 Eval Rate (tokens/s)1.841.351.104.5318.95

4xA6000 Outperforms 4xA100 Across All 70B Models

GPU SetupTotal VRAMMarket PriceTotal Throughput (tokens/s)Can Run 70-72B?
4×A6000 48GB192GB~$8,000–10,000~450✅ Yes
4×A100 40GB160GB~$16,000–20,000~150✅ Yes (not recommended)
Despite the A100’s stronger FP16 compute power, the 160GB VRAM ceiling severely bottlenecks performance. The inference throughput for Qwen2.5-VL-72B, for example, is nearly 3× faster on A6000s (449.51 vs. 154.56 tokens/s). Similar trends are seen for Meta-Llama-3 and DeepSeek models.
  • Reason: The 72B model (~137GB) just fits into the A100’s 160GB total memory — leaving minimal room for KV cache and resulting in slow response times and high latency. Median TPOT for Qwen2.5 on A100 is 92.95 ms vs. 107.41 ms on A6000, but P99 latency explodes on A100 due to memory exhaustion.

4xA100 Performs Better with Smaller Models (32B)

When switching to a smaller model like Qwen3-32B (65GB), the memory pressure eases on A100s, and the compute advantage begins to show:
GPUModelTotal Throughput(tokens/s)Median TPOT(ms)Eval Rate(tokens/s)
4×A6000 48GBQwen3-32B(65GB)827.7452.3319.10
4×A100 40GBQwen3-32B(65GB)1106.6141.9223.85
This flip indicates that the bottleneck on A100 is not compute, but VRAM availability. For smaller LLMs, A100's full potential is unleashed.

Get Started with Multi-GPU Server Hosting

Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.

Multi-GPU Dedicated Server - 4xRTX A6000

1199.00/mo
1mo3mo12mo24mo
Order Now
  • 512GB RAM
  • Dual 22-Core E5-2699v4
  • 240GB SSD + 4TB NVMe + 16TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 4 x Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS

Multi-GPU Dedicated Server - 8xRTX A6000

2099.00/mo
1mo3mo12mo24mo
Order Now
  • 512GB RAM
  • Dual 22-Core E5-2699v4
  • 240GB SSD + 4TB NVMe + 16TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 8 x Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

999.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual Gold 6148
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • GPU Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Multi-GPU Dedicated Server - 4xA100

1899.00/mo
1mo3mo12mo24mo
Order Now
  • 512GB RAM
  • Dual 22-Core E5-2699v4
  • 240GB SSD + 4TB NVMe + 16TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 4 x Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS

Conclusion

The choice between A100 and A6000 GPUs should be guided by model size:

  • For 70B+ models (137GB): Go with 4×A6000. The extra 32GB memory unlocks significantly better throughput and lower latency.
  • For sub-32B models: 4×A100s can outperform A6000s due to their superior compute, assuming memory is sufficient.
If you're deploying large Hugging Face models like Qwen2.5-VL-72B or Meta-Llama-3, don’t be misled by raw TFLOPs — memory bandwidth and capacity are king in real-world inference.

Attachment: Video Recording of 4*A100 vLLM Benchmark

Attachment: Video Recording of 4*A6000 vLLM Benchmark

Tags:

LLM inference benchmark, 72B model GPU test, 4xA100 vs 4xA6000, Qwen2.5 VL-72B, Meta-Llama-3 70B, GPU memory limits, Hugging Face large models, vLLM inference speed