Models | Qwen2.5-3B-Instruct | Qwen2.5-VL-7B-Instruct | gemma-2-9b-it | DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 5.8 | 16 | 18 | 3.4 | 15 | 15 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 12.15 | 19.56 | 6.37 | 7.56 | 21.86 | 19.55 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 27997 | 21092 | 1057 | 24743 | 28101 | 26383 |
Request (req/s) | 4.11 | 2.56 | 7.85 | 6.62 | 2.29 | 2.56 |
Input (tokens/s) | 411.38 | 255.62 | 785.23 | 661.67 | 228.75 | 255.70 |
Output (tokens/s) | 2303.50 | 1078.30 | 166.00 | 3274.32 | 1285.59 | 1349.22 |
Total Throughput (tokens/s) | 2714.88 | 1333.92 | 951.23 | 3935.99 | 1514.34 | 1604.92 |
Median TTFT(ms) | 522.38 | 1031.72 | 1117.28 | 300.35 | 899.63 | 878.94 |
P99 TTFT(ms) | 534.04 | 1042.32 | 1447.61 | 322.22 | 1051.76 | 1010.26 |
Median TPOT(ms) | 19.35 | 30.89 | 90.27 | 12.04 | 34.94 | 31.14 |
P99 TPOT(ms) | 19.84 | 32.40 | 460.90 | 30.65 | 36.63 | 33.87 |
Median Eval Rate(tokens/s) | 51.68 | 32.37 | 11.08 | 83.06 | 28.62 | 32.11 |
P99 Eval Rate(tokens/s) | 50.40 | 30.86 | 2.17 | 32.63 | 27.30 | 29.52 |
Models | Qwen2.5-3B-Instruct | Qwen2.5-VL-7B-Instruct | gemma-2-9b-it | DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 5.8 | 16 | 18 | 3.4 | 15 | 15 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 300 | 300 | 300 | 300 | 300 | 300 |
Benchmark Duration(s) | 43.45 | 67.64 | 30.71 | 39.46 | 116.80 | 80.15 |
Total Input Tokens | 30000 | 30000 | 30000 | 30000 | 30000 | 30000 |
Total Generated Tokens | 169912 | 131231 | 6106 | 160989 | 166598 | 161837 |
Request (req/s) | 6.9 | 4.44 | 9.77 | 7.6 | 2.57 | 3.74 |
Input (tokens/s) | 690.49 | 443.52 | 976.88 | 760.3 | 256.85 | 374.28 |
Output (tokens/s) | 3910.80 | 1940.12 | 198.83 | 4079.99 | 1426.39 | 2019.08 |
Total Throughput (tokens/s) | 4601.29 | 2383.64 | 1175.71 | 4840.29 | 1683.24 | 2393.36 |
Median TTFT(ms) | 1769.48 | 3833.79 | 4464.43 | 1159.54 | 3237.97 | 3172.35 |
P99 TTFT(ms) | 33867.66 | 13073.93 | 14602.77 | 28362.24 | 72254.94 | 60533.94 |
Median TPOT(ms) | 53.44 | 82.10 | 96.79 | 55.46 | 103.34 | 83.60 |
P99 TPOT(ms) | 55.10 | 2282.99 | 318.39 | 123.10 | 361.44 | 166.28 |
Median Eval Rate(tokens/s) | 18.71 | 12.18 | 10.33 | 18.03 | 9.68 | 11.96 |
P99 Eval Rate(tokens/s) | 18.14 | 0.44 | 3.14 | 8.12 | 2.77 | 6.01 |
Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.
Advanced GPU Dedicated Server - A5000
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
The NVIDIA A5000 proves to be a solid choice for LLM inference using vLLM on Hugging Face models. It efficiently runs 8B and smaller models while handling up to 150 concurrent requests smoothly.
For AI developers and cloud providers looking for cost-effective vLLM server rental, the A5000 is a great alternative to A100 for mid-sized LLM deployments.
The A5000 vLLM benchmark results highlight the importance of strategic model selection and performance tuning when deploying LLMs. By understanding the strengths and limitations of different models and hardware configurations, organizations can optimize their vLLM deployments for maximum efficiency and user satisfaction. Whether you're running benchmarks, considering a vLLM server rental, or planning a hardware upgrade, these insights will help you make informed decisions to meet your performance goals.
For more detailed benchmarks and performance tuning tips, stay tuned to our blog and explore our comprehensive guides on vLLM performance tuning, A5000 benchmark results, and Hugging Face LLM deployments.
vLLM Performance Tuning, A5000 benchmark, A5000 benchmark results, A5000 test, vLLM A5000, Hugging Face LLM, vLLM server rental, LLM inference optimization, A5000 GPU benchmark, DeepSeek-R1 performance, AI model inference, vLLM token generation, AI latency test, cloud GPU rental, vLLM throughput