GPU Server Promotion, Up to 59% OFF, Order Now>



4×A6000 vLLM Benchmark: The Cheapest GPU for 70B–72B Hugging Face Model Inference

As large language models (LLMs) like Qwen-72B, Meta-Llama-3-70B, and DeepSeek-70B dominate NLP workloads, deploying them efficiently has become a top concern for researchers, startups, and infrastructure engineers. These models require over 130GB of VRAM, which pushes many users toward expensive GPU setups such as dual A100 80GB or dual H100 80GB—both of which cost tens of thousands of dollars.

However, there’s a highly cost-effective alternative: the 4×NVIDIA A6000 (48GB) setup. This configuration delivers 192GB of total VRAM, enough to cover all current 70–72B Hugging Face models using vLLM, a fast inference engine for transformer models.

Test Overview

1. A Single A6000 GPU Details:

GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
Compute capability: 8.6
CUDA Cores: 10,752
Tensor Cores: 336
Memory: 48GB GDDR6
FP32 performance: 38.71 TFLOPS

2. Test Project Code Source:

We used this git project to build the environment（https://github.com/vllm-project/vllm）

3. The Following Models from Hugging Face were Tested:

Qwen/Qwen2.5-VL-72B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
meta-llama/Llama-3.1-70B
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Qwen/Qwen3-32B

4. The Online Test Parameters are Preset as Follows:

Input length: 100 tokens
Output length: 600 tokens
50 concurrent requests
--tensor-parallel-size 4

4×A6000 vLLM Benchmark for 50 Concurrent Requests

According to a recent benchmark, 4×A6000 handles 70B+ models with strong performance and acceptable latency. Here’s how the numbers break down.

Models	Qwen/Qwen2.5-VL-72B-Instruct	meta-llama/Meta-Llama-3-70B-Instruct	meta-llama/Llama-3.1-70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	Qwen/Qwen3-32B
Quantization	16	16	16	16	16
Size（GB）	137	132	132	132	65
Backend/Platform	vLLM	vLLM	vLLM	vLLM	vLLM
Tensor Parallelism	4	4	4	4	4
Request Numbers	50	50	50	50	50
Benchmark Duration(s)	68.59	63.96	65.50	66.46	33.11
Total Input Tokens	5000	5000	5000	5000	5000
Total Generated Tokens	25834	22252	24527	25994	22406
Request (req/s)	0.73	0.78	0.76	0.75	1.51
Input (tokens/s)	72.89	78.17	76.34	75.22	151.02
Output (tokens/s)	376.62	347.93	374.45	391.10	676.72
Total Throughput (tokens/s)	449.51	426.10	450.79	466.32	827.74
Median TTFT (ms)	4134.85	4028.61	3945.48	4055.27	1958.54
P99 TTFT (ms)	5269.88	5126.21	5019.25	5157.44	2495.06
Median TPOT (ms)	107.41	102.25	102.61	104.10	52.33
P99 TPOT (ms)	113.69	1743.43	1900.50	136.96	96.08
Median Eval Rate (tokens/s)	9.31	9.78	9.74	9.61	19.10
P99 Eval Rate (tokens/s)	8.79	0.57	0.56	7.30	10.41

✅ Key observations:

Throughput is consistently 420–470 tokens/s, with Qwen2.5 achieving 449.51 tokens/s.
Median TTFT (time-to-first-token) remains under 4200 ms across all models, indicating quick model responsiveness.
Total inference time remains under 70 seconds even for massive input/output workloads (5000 input tokens, 25K+ output tokens).
Memory usage fits perfectly within 192GB, which rules out memory swapping or fragmentation issues.

A6000 vs A100 vs H100: What’s the Real Cost?

GPU Setup	Total VRAM	Market Price	Can Run 70B+?	Typical Use Case
4×A6000 48GB	192GB	~$8,000–10,000	✅ Yes	Cost-effective inference
2×A100 80GB	160GB	~$16,000–20,000	✅ Yes (tight fit)	High-end LLM deployment
2×H100 80GB	160GB	~$25,000–30,000	✅ Yes	Ultra-performance research

The 4×A6000 wins on cost-per-token throughput, offering more VRAM than dual A100s or H100s at half the price.
For startups or academic labs, this is the most budget-friendly inference solution without sacrificing performance or model scale.

Can It Run Smaller Models Like Qwen3-32B?

Absolutely! In fact, performance with Qwen3-32B on 4×A6000 is exceptionally fast:

Model	Total Throughput (tokens/s)	Output Tokens/s	Median TTFT (ms)
Qwen3-32B	827.74	676.72	1958.54

That’s nearly 2× the throughput of Qwen2.5-72B, with less than half the latency.
So if your workload involves mixed-size inference, 4×A6000 is even more attractive.

Get Started with 4*A6000 GPU Server Hosting

Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.

Multi-GPU Dedicated Server - 4xRTX A6000

$ 1199.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 4 x Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Multi-GPU Dedicated Server - 8xRTX A6000

$ 2099.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 8 x Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

$ 999.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual Gold 6148
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps

OS: Windows / Linux
GPU: 2 x GeForce RTX 5090
Microarchitecture: Ada Lovelace
CUDA Cores: 20,480
Tensor Cores: 680
GPU Memory: 32 GB GDDR7
FP32 Performance: 109.7 TFLOPS

Multi-GPU Dedicated Server - 4xA100

$ 1899.00/mo

1mo3mo12mo24mo

Order Now

512GB RAM
Dual 22-Core E5-2699v4
240GB SSD + 4TB NVMe + 16TB SATA
1Gbps

OS: Windows / Linux
GPU: 4 x Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Conclusion: 4*A6000 is a Cheap Choice for Hugging Face 70-72B LLMs

If you’re searching for the cheapest GPU configuration to run Hugging Face 70B or 72B models, 4×A6000 48GB (192GB total) delivers unbeatable price-to-performance. Whether you’re running Meta-Llama-3-70B, Qwen-72B, or DeepSeek, this setup supports all of them under vLLM, with throughput comparable to flagship GPUs but at a fraction of the cost.

Attachment: Video Recording of 4*A6000 vLLM Benchmark

Data Item Explanation in the Table:

Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
Size(GB): Model size in GB.
Backend: The inference backend used. In this test, vLLM is used.
--tensor-parallel-size: Specifies the number of GPUs used to split the model's tensors horizontally (layer-wise), reducing computation time but requiring high GPU memory bandwidth (e.g., --tensor-parallel-size 2 splits workloads across 2 GPUs).
--pipeline-parallel-size: Distributes model layers vertically (stage-wise) across GPUs, enabling larger models to run with higher memory efficiency but introducing communication overhead (e.g., --pipeline-parallel-size 3 divides layers sequentially across 3 GPUs).
--dtype: Defines the numerical precision (e.g., --dtype float16 for FP16) to balance memory usage and computational accuracy during inference. Lower precision (e.g., FP16) speeds up inference but may slightly reduce output quality.
Successful Requests: The number of requests processed.
Benchmark duration(s): The total time to complete all requests.
Total input tokens: The total number of input tokens across all requests.
Total generated tokens: The total number of output tokens generated across all requests.
Request (req/s): The number of requests processed per second.
Input (tokens/s): The number of input tokens processed per second.
Output (tokens/s): The number of output tokens generated per second.
Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.

Tags:

4*A6000 vLLM benchmark, vLLM multi-GPU inference, cheapest GPU for 70B model inference, A6000 vs A100 vs H100 for LLM, best GPU for running llama3 70B, Qwen 72B inference setup, hugging face 72B model GPU requirements, vLLM performance 70B model, affordable large language model deployment