Models | gemma-2-9b-it | gemma-2-27b-it | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-14B | DeepSeek-R1-Distill-Qwen-32B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 18.5GB | 54.5GB | 16.1GB | 15.2GB | 29.5GB | 65.5GB |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
CPU Rate | 1.4% | 1.4% | 1.5% | 1.4% | 1.4% | 1.4% |
RAM Rate | 5.5% | 5.8% | 4.8% | 5.7% | 5.8% | 4.8% |
GPU vRAM Rate | 90% | 91.4% | 90.3% | 90.3% | 90.2% | 91.5% |
GPU UTL | 80-86% | 88-94% | 80-82% | 80% | 72-87% | 91-92% |
Request (req/s) | 5.99 | 2.62 | 9.27 | 10.45 | 6.15 | 1.99 |
Total Duration | 49s | 1min54s | 32s | 28s | 48s | 2min30s |
Input (tokens/s) | 599.06 | 262.45 | 926.91 | 1044.98 | 614.86 | 198.66 |
Output (tokens/s) | 3594.38 | 1574.67 | 5561.45 | 6269.87 | 3689.21 | 1191.95 |
Total Throughput (tokens/s) | 4193.44 | 1837.12 | 6488.36 | 7314.85 | 4304.07 | 1390.61 |
Models | gemma-2-9b-it | gemma-2-27b-it | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-14B | DeepSeek-R1-Distill-Qwen-32B |
---|---|---|---|---|---|---|
Quantification | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 18.5GB | 54.5GB | 16.1GB | 15.2GB | 29.5GB | 65.5GB |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
CPU Rate | 1.4% | 1.4% | 2.8% | 3% | 2.5% | 2.0% |
RAM Rate | 13.9% | 15.4% | 11.5% | 18% | 15.9% | 11.8% |
GPU vRAM Rate | 90.2% | 90.9% | 90.2% | 90.3% | 90% | 90.3% |
GPU UTL | 93% | 86%-96% | 83%-85% | 50%-91% | 80-90% | 92%-96% |
Request (req/s) | 27.03 | 8.95 | 8.80 | 10.79 | 9.81 | 2.67 |
Total Duration | 11s | 33s | 33s | 27s | 30s | 1min52s |
Input (tokens/s) | 2703.21 | 894.9 | 900.25 | 1078.68 | 981.01 | 267.43 |
Output (tokens/s) | 566.59 | 1376.16 | 5033.38 | 5803.79 | 3840.07 | 1214.19 |
Total Throughput (tokens/s) | 3269.80 | 2271.06 | 5933.63 | 6882.47 | 4821.08 | 1481.62 |
Enterprise GPU Dedicated Server - H100
Enterprise GPU Dedicated Server - A100(80GB)
Multi-GPU Dedicated Server - 2xA100
Enterprise GPU Dedicated Server - A100
H100 Server Rental,vllm h100 benchmark, h100 vllm performance testing, vllm online vs offline testing, h100 large language model inference, vllm hugging face models, h100 gpu server rental, vllm performance optimization, h100 deep learning inference