Models | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | Llama-3.1-8B-Instruct | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B |
---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 28 | 15 | 15 | 28 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 17.81 | 36.25 | 18.26 | 19.37 | 31.04 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 23817 | 25213 | 22247 | 26839 | 17572 |
Request (req/s) | 2.81 | 1.38 | 2.74 | 2.58 | 1.61 |
Input (tokens/s) | 280.74 | 137.93 | 273.89 | 258.16 | 161.07 |
Output (tokens/s) | 1337.24 | 695.51 | 1218.63 | 1385.77 | 566.08 |
Total Throughput (tokens/s) | 1617.98 | 833.44 | 1492.52 | 1643.93 | 727.15 |
Median TTFT(ms) | 583.54 | 1095.08 | 593.26 | 652.56 | 474.60 |
P99 TTFT(ms) | 761.02 | 1407.86 | 810.46 | 857.21 | 637.96 |
Median TPOT(ms) | 28.67 | 58.59 | 29.40 | 31.15 | 51.18 |
P99 TPOT(ms) | 160.33 | 63.10 | 39.80 | 38.20 | 64.46 |
Median Eval Rate(tokens/s) | 34.88 | 17.07 | 34.01 | 32.10 | 19.54 |
P99 Eval Rate(tokens/s) | 6.23 | 15.85 | 25.13 | 26.18 | 15.51 |
Models | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | Llama-3.1-8B-Instruct | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B |
---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 28 | 15 | 15 | 28 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 22.41 | 49.77 | 21.52 | 22.51 | 37.09 |
Total Input Tokens | 10000 | 10000 | 10000 | 10000 | 10000 |
Total Generated Tokens | 49662 | 51632 | 47223 | 54501 | 38003 |
Request (req/s) | 4.46 | 2.01 | 4.65 | 4.44 | 2.70 |
Input (tokens/s) | 446.22 | 200.93 | 464.6 | 444.22 | 269.61 |
Output (tokens/s) | 2216.02 | 1037.44 | 2193.97 | 2421.04 | 1024.60 |
Total Throughput (tokens/s) | 2662.24 | 1238.37 | 2658.57 | 2865.26 | 1294.21 |
Median TTFT(ms) | 414.18 | 804.91 | 543.01 | 585.56 | 1673.75 |
P99 TTFT(ms) | 950.56 | 1688.46 | 1015.96 | 1086.52 | 2830.12 |
Median TPOT(ms) | 36.74 | 69.58 | 34.86 | 36.42 | 59.22 |
P99 TPOT(ms) | 232.43 | 80.00 | 55.98 | 47.82 | 126.96 |
Median Eval Rate(tokens/s) | 27.22 | 14.37 | 29.68 | 27.46 | 16.89 |
P99 Eval Rate(tokens/s) | 4.30 | 12.5 | 17.86 | 20.91 | 7.87 |
Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.
Enterprise GPU Dedicated Server - RTX A6000
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
The benchmark results demonstrate that A6000 GPUs with vLLM backend can effectively host medium-sized LLMs (7B-14B parameters) for production workloads. The DeepSeek-R1-Distill-LLama-8B model showed particularly strong performance across all metrics, making it an excellent choice for A6000 deployments requiring balanced throughput and latency.
For applications prioritizing throughput over model size, the 7B/8B class models generally provide better performance characteristics on A6000 hardware. The 14B models remain viable for applications where model capability outweighs pure performance metrics.
A6000 vLLM benchmark, vLLM A6000 performance, A6000 LLMs multi-concurrency, A6000 hosting LLMs, Qwen vs LLaMA-3 A6000, DeepSeek-R1 A6000 benchmark, Best LLM for A6000, A6000 token throughput, LLM inference speed A6000, vLLM vs Ollama A6000