Dual A100 vLLM Benchmark: Best-In-Class Inference for 14B–32B Hugging Face LLMs

Large Language Models (LLMs) continue to grow in scale and capability, but serving them efficiently—especially in production environments—demands the right infrastructure. In this benchmark, we explore the inference performance of dual NVIDIA A100 40GB GPUs (total 80GB) using vLLM across popular Hugging Face models like Gemma 3-27B, Qwen 32B, LLaMA 3-8B, and DeepSeek 14B–32B.

With tensor-parallel-size set to 2 and NVLink enabled, this setup represents the gold standard for high-throughput, low-latency inference of large 14B–32B models.

Test Overview

1. A Single A100 40GB GPU Details(with NVLink):

  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • Compute capability: 8.0
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • Memory: 40GB HBM2
  • FP32 performance: 19.5 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • google/gemma-3-12b-it
  • google/gemma-3-27b-it
  • meta-llama/Llama-3.1-8B-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
  • Qwen/QwQ-32B

4. The Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens
  • --tensor-parallel-size 2
  • --max-model-len 4096

5. We conducted two rounds of 2*A100 vLLM tests under different concurrent request loads:

  • Scenario 1: 50 concurrent requests
  • Scenario 2: 100 concurrent requests

2*A100 Benchmark for Scenario 1: 50 Concurrent Requests

Modelsgoogle/gemma-3-12b-itgoogle/gemma-3-27b-itmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-32BQwen/QwQ-32B
Quantization161616161616
Size(GB)235115286262
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers505050505050
Benchmark Duration(s)12.9722.777.2811.6029.1633.39
Total Input Tokens500050005000500050005000
Total Generated Tokens270042305222103175282239327584
Request (req/s)3.862.206.874.311.711.50
Input (tokens/s)385.61219.59687.19430.91171.45149.74
Output (tokens/s)2082.651012.403037.771510.59767.89826.03
Total Throughput (tokens/s)2468.261231.993724.961941.50939.34975.77
Median TTFT (ms)128.42710.96206.38386.49783.65785.46
P99 TTFT (ms)154.23888.83307.44524.141006.791003.58
Median TPOT (ms)21.3734.9711.7018.7036.5936.77
P99 TPOT (ms)22.48215.4715.7829.9953.0353.96
Median Eval Rate (tokens/s)46.7928.6085.4753.4827.3327.20
P99 Eval Rate (tokens/s)44.484.6463.3733.3436.5918.53

✅ Key Takeaways:

  • 8B models (Llama-3) dominate in speed (~3.7K tokens/s). Gemma-3-12B performs well (~2.4K tokens/s, low TTFT). 32B models (DeepSeek, Qwen) are usable but slower (~1K tokens/s).
  • Even at 50 concurrent requests, models up to 32B run smoothly without hitting performance ceilings.

2*A100 Benchmark for Scenario 2: 100 Concurrent Requests

Modelsgoogle/gemma-3-12b-itgoogle/gemma-3-27b-itmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-32BQwen/QwQ-32B
Quantization161616161616
Size(GB)235115286262
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers100100100100100100
Benchmark Duration(s)17.7941.239.3913.1347.3060.46
Total Input Tokens100001000010000100001000010000
Total Generated Tokens497244568247902377644461356858
Request (req/s)5.622.4310.657.622.111.65
Input (tokens/s)561.97242.551065.37761.66211.43165.4
Output (tokens/s)2794.321108.035103.342876.36943.22940.42
Total Throughput (tokens/s)3356.291350.586168.713638.021154.651105.82
Median TTFT (ms)541.36651.69215.63272.59865.14872.33
P99 TTFT (ms)873.221274.52432.56654.961666.141652.29
Median TPOT (ms)28.6649.0115.2121.4554.9166.68
P99 TPOT (ms)146.27325.9021.5238.99181.85102.40
Median Eval Rate (tokens/s)34.8920.4065.7546.6218.2114.99
P99 Eval Rate (tokens/s)6.843.0746.4725.655.509.77

✅ Key Takeaways:

  • Throughput scales almost linearly (Llama-3 8b hits 6K tokens/s). 14B-32B models remain stable—no latency spikes at 100 reqs.
  • Gemma-3-12B runs smoothly (3.3K tokens/s), proving multi-GPU efficiency.
  • Even at 100 requests, throughput continues to increase and latency remains low, confirming no bottleneck—this system can likely handle 200+ concurrent inferences.

Why Dual A100 40GB Is the Best Fit for 14B–32B Models

✅ Perfect Match for Multi-modal Giants

Models like Gemma-3-27B and Qwen-32B are multi-modal and memory-intensive. They simply cannot run reliably on GPUs with less than 80GB combined VRAM. A dual-A100 setup solves this.

✅ NVLink & Tensor Parallelism Matter

Always use --tensor-parallel-size 2 to fully utilize both A100 processors, take advantage of NVLink for faster inter-GPU communication, and avoid using pipeline parallelism unless absolutely necessary; tensor parallelism enables lower latency and higher throughput.

Operational Tips & Cautions

⚠️ -max-model-len 4096 or 8192

For models like Qwen-32B and DeepSeek-32B, always set: --max-model-len 4096. Otherwise, you may encounter out-of-memory or shape mismatch errors.

⚠️ GPU Temperatures

Watch GPU Temperatures: At 100+ concurrent requests, GPU temp may exceed 80°C. Tip: Use nvitop + manual fan boost for 5 minutes if temp exceeds 80°C.

Get Started with 2*A100 GPU Server Hosting

Interested in optimizing your LLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.

Flash Sale to May 13

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Multi-GPU Dedicated Server - 2xA100

1099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Free NVLink Included
  • A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Enterprise GPU Dedicated Server - H100

2099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

Conclusion: Dual A100 (Total 80GB) Is the Production-Ready Workhorse for 14B–32B LLMs

Dual A100 40GB GPUs (with NVLink) are an excellent choice for 14B-32B models, including Gemma-3-27B, achieving 3K-6K tokens/s at 100+ requests.

  • For startups & researchers: Cost-effective alternative to H100.
  • For enterprises: Stable for 200-300 concurrent users (no bottleneck detected).
  • Always use --tensor-parallel-size 2 for maximum throughput!
For ultra-large models (70B+), consider 2* H100, 2*A100 80GB or 4*A100 40GB, but 2×A100 40GB is perfect for 32B and below.

Attachment: Video Recording of 2*A100 vLLM Benchmark

Screenshot: 2*A100 40GB vLLM benchmark with 50 Concurrent Requests
google/gemma-3-12b-itgoogle/gemma-3-27b-itmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-32BQwen/QwQ-32B
Screenshot: 2*A100 40GB vLLM benchmark with 100 Concurrent Requests
google/gemma-3-12b-itgoogle/gemma-3-27b-itmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-32BQwen/QwQ-32B

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size(GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • Successful Requests: The number of requests processed.
  • Benchmark duration(s): The total time to complete all requests.
  • Total input tokens: The total number of input tokens across all requests.
  • Total generated tokens: The total number of output tokens generated across all requests.
  • Request (req/s): The number of requests processed per second.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
  • Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
  • P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
  • Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
  • P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
  • Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
  • P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.
Tags:

Dual A100 vLLM Benchmark, 2x A100 40GB LLM Hosting, Hugging Face 32B Inference, Multi-GPU vLLM Benchmark, tensor-parallel-size 2 benchmark,VInfer gemma-3 on A100, Qwen 32B GPU requirements, A100 vs H100 for LLMs, Best GPU for 32B LLM inference, Hosting 14B–32B LLMs on vLLM