Models | google/gemma-3-12b-it | google/gemma-3-27b-it | meta-llama/Llama-3.1-8B-Instruct | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Qwen/QwQ-32B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 23 | 51 | 15 | 28 | 62 | 62 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 12.97 | 22.77 | 7.28 | 11.60 | 29.16 | 33.39 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 27004 | 23052 | 22103 | 17528 | 22393 | 27584 |
Request (req/s) | 3.86 | 2.20 | 6.87 | 4.31 | 1.71 | 1.50 |
Input (tokens/s) | 385.61 | 219.59 | 687.19 | 430.91 | 171.45 | 149.74 |
Output (tokens/s) | 2082.65 | 1012.40 | 3037.77 | 1510.59 | 767.89 | 826.03 |
Total Throughput (tokens/s) | 2468.26 | 1231.99 | 3724.96 | 1941.50 | 939.34 | 975.77 |
Median TTFT (ms) | 128.42 | 710.96 | 206.38 | 386.49 | 783.65 | 785.46 |
P99 TTFT (ms) | 154.23 | 888.83 | 307.44 | 524.14 | 1006.79 | 1003.58 |
Median TPOT (ms) | 21.37 | 34.97 | 11.70 | 18.70 | 36.59 | 36.77 |
P99 TPOT (ms) | 22.48 | 215.47 | 15.78 | 29.99 | 53.03 | 53.96 |
Median Eval Rate (tokens/s) | 46.79 | 28.60 | 85.47 | 53.48 | 27.33 | 27.20 |
P99 Eval Rate (tokens/s) | 44.48 | 4.64 | 63.37 | 33.34 | 36.59 | 18.53 |
Models | google/gemma-3-12b-it | google/gemma-3-27b-it | meta-llama/Llama-3.1-8B-Instruct | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Qwen/QwQ-32B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 23 | 51 | 15 | 28 | 62 | 62 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 100 | 100 | 100 | 100 | 100 | 100 |
Benchmark Duration(s) | 17.79 | 41.23 | 9.39 | 13.13 | 47.30 | 60.46 |
Total Input Tokens | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
Total Generated Tokens | 49724 | 45682 | 47902 | 37764 | 44613 | 56858 |
Request (req/s) | 5.62 | 2.43 | 10.65 | 7.62 | 2.11 | 1.65 |
Input (tokens/s) | 561.97 | 242.55 | 1065.37 | 761.66 | 211.43 | 165.4 |
Output (tokens/s) | 2794.32 | 1108.03 | 5103.34 | 2876.36 | 943.22 | 940.42 |
Total Throughput (tokens/s) | 3356.29 | 1350.58 | 6168.71 | 3638.02 | 1154.65 | 1105.82 |
Median TTFT (ms) | 541.36 | 651.69 | 215.63 | 272.59 | 865.14 | 872.33 |
P99 TTFT (ms) | 873.22 | 1274.52 | 432.56 | 654.96 | 1666.14 | 1652.29 |
Median TPOT (ms) | 28.66 | 49.01 | 15.21 | 21.45 | 54.91 | 66.68 |
P99 TPOT (ms) | 146.27 | 325.90 | 21.52 | 38.99 | 181.85 | 102.40 |
Median Eval Rate (tokens/s) | 34.89 | 20.40 | 65.75 | 46.62 | 18.21 | 14.99 |
P99 Eval Rate (tokens/s) | 6.84 | 3.07 | 46.47 | 25.65 | 5.50 | 9.77 |
Interested in optimizing your LLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.
Enterprise GPU Dedicated Server - A100
Multi-GPU Dedicated Server - 2xA100
Enterprise GPU Dedicated Server - A100(80GB)
Enterprise GPU Dedicated Server - H100
Dual A100 40GB GPUs (with NVLink) are an excellent choice for 14B-32B models, including Gemma-3-27B, achieving 3K-6K tokens/s at 100+ requests.
Dual A100 vLLM Benchmark, 2x A100 40GB LLM Hosting, Hugging Face 32B Inference, Multi-GPU vLLM Benchmark, tensor-parallel-size 2 benchmark,VInfer gemma-3 on A100, Qwen 32B GPU requirements, A100 vs H100 for LLMs, Best GPU for 32B LLM inference, Hosting 14B–32B LLMs on vLLM