Models | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Llama-8B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 15 | 15 | 15 | 15 | 15 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
GPU Server | RTX4090*1 | RTX4090*2 | RTX4090*2 | RTX4090*1 | RTX4090*2 | RTX4090*2 |
Tensor Parallel Size | 1 | 1 | 2 | 1 | 1 | 2 |
Request Numbers | 300 | 300 | 300 | 300 | 300 | 300 |
Benchmark Duration(s) | 48.35 | 106.97 | 35.08 | 72.42 | 190.69 | 49.22 |
Total Input Tokens | 30000 | 30000 | 30000 | 30000 | 30000 | 30000 |
Total Generated Tokens | 161719 | 162207 | 162191 | 165500 | 166019 | 164855 |
Request (req/s) | 6.21 | 2.80 | 8.55 | 4.14 | 1.57 | 6.10 |
Input (tokens/s) | 620.5 | 280.46 | 855.28 | 414.28 | 157.32 | 609.55 |
Output (tokens/s) | 3344.91 | 1516.37 | 4623.98 | 2285.44 | 870.61 | 3349.59 |
Total Throughput (tokens/s) | 3965.41 | 1796.83 | 5479.26 | 2699.72 | 1027.93 | 3959.14 |
Median TTFT (ms) | 1818.40 | 34768.42 | 2226.99 | 1934.14 | 80444.02 | 2450.92 |
P99 TTFT (ms) | 34395.81 | 88093.83 | 3921.81 | 43966.05 | 178349.91 | 4586.66 |
Median TPOT (ms) | 48.38 | 32.72 | 54.67 | 63.14 | 33.33 | 64.61 |
P99 TPOT (ms) | 91.12 | 86.83 | 109.14 | 144.73 | 96.93 | 121.51 |
Median Eval Rate (tokens/s) | 20.70 | 30.56 | 18.29 | 15.83 | 30.00 | 15.48 |
P99 Eval Rate (tokens/s) | 10.97 | 11.5 | 9.16 | 6.91 | 10.31 | 8.23 |
administrator@Newt:~$ nvidia-smi topo -m GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS 0-17,36-53 0 N/A GPU1 SYS X 18-35,54-71 1 N/A
CUDA_VISIBLE_DEVICES=0 python benchmark_serving.py --backend vllm --base-url http://127.0.0.1:8000 --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --dataset-name=random --num-prompts 300 --request-rate inf --random-input-len 100 --random-output-len 600
By properly configuring vLLM’s distributed architecture, you can maximize throughput, minimize latency, and fully utilize your multi-GPU hardware. 🚀
Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.
Enterprise GPU Dedicated Server - RTX 4090
Multi-GPU Dedicated Server- 2xRTX 4090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
vLLM optimization, vLLM distributed inference, multi-GPU inference, tensor-parallel-size, vLLM performance tuning, vLLM architecture, deep learning inference, LLM serving, NUMA optimization, vLLM benchmark