A6000 vLLM Benchmark Report: Multi-Concurrent LLM Inference Performance

In the rapidly evolving field of large language models (LLMs), performance benchmarking has become crucial for researchers and developers. This report analyzes the comparative performance of five popular Hugging Face LLMs running on NVIDIA A6000 GPUs using the vLLM backend, tested under both 50 and 100 concurrent request scenarios.

If you're looking for vLLM server rental, optimizing vLLM performance tuning, or understanding A6000 benchmark results, this report offers key takeaways.

Test Overview

1. A6000 GPU Details:

  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • Compute capability: 8.6
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • Memory: 48GB GDDR6
  • FP32 performance: 38.71 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

4. The Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens

5. We conducted two rounds of A6000 vLLM tests under different concurrent request loads:

  • Scenario 1: 50 concurrent requests
  • Scenario 2: 100 concurrent requests

A6000 Benchmark for Scenario 1: 50 Concurrent Requests

ModelsQwen2.5-7B-InstructQwen2.5-14B-InstructLlama-3.1-8B-InstructDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14B
Quantization1616161616
Size(GB)1528151528
Backend/PlatformvLLMvLLMvLLMvLLMvLLM
Request Numbers5050505050
Benchmark Duration(s)17.8136.2518.2619.3731.04
Total Input Tokens50005000500050005000
Total Generated Tokens2381725213222472683917572
Request (req/s)2.811.382.742.581.61
Input (tokens/s)280.74137.93273.89258.16161.07
Output (tokens/s)1337.24695.511218.631385.77566.08
Total Throughput (tokens/s)1617.98833.441492.521643.93727.15
Median TTFT(ms)583.541095.08593.26652.56474.60
P99 TTFT(ms)761.021407.86810.46857.21637.96
Median TPOT(ms)28.6758.5929.4031.1551.18
P99 TPOT(ms)160.3363.1039.8038.2064.46
Median Eval Rate(tokens/s)34.8817.0734.0132.1019.54
P99 Eval Rate(tokens/s)6.2315.8525.1326.1815.51

✅ Key Takeaways:

  • The DeepSeek-R1-Distill-LLama-8B model achieved the highest total throughput at 1,643.93 tokens/s.
  • Qwen2.5-7B-Instruct followed closely with 1,617.98 tokens/s.
  • The larger Qwen2.5-14B-Instruct model showed expectedly lower throughput at 833.44 tokens/s, but this is not a slow speed.

A6000 Benchmark for Scenario 2: 100 Concurrent Requests

ModelsQwen2.5-7B-InstructQwen2.5-14B-InstructLlama-3.1-8B-InstructDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14B
Quantization1616161616
Size(GB)1528151528
Backend/PlatformvLLMvLLMvLLMvLLMvLLM
Request Numbers5050505050
Benchmark Duration(s)22.4149.7721.5222.5137.09
Total Input Tokens1000010000100001000010000
Total Generated Tokens4966251632472235450138003
Request (req/s)4.462.014.654.442.70
Input (tokens/s)446.22200.93464.6444.22269.61
Output (tokens/s)2216.021037.442193.972421.041024.60
Total Throughput (tokens/s)2662.241238.372658.572865.261294.21
Median TTFT(ms)414.18804.91543.01585.561673.75
P99 TTFT(ms)950.561688.461015.961086.522830.12
Median TPOT(ms)36.7469.5834.8636.4259.22
P99 TPOT(ms)232.4380.0055.9847.82126.96
Median Eval Rate(tokens/s)27.2214.3729.6827.4616.89
P99 Eval Rate(tokens/s)4.3012.517.8620.917.87

✅ Key Takeaways:

  • DeepSeek-R1-Distill-LLama-8B maintained its lead with 2,865.26 tokens/s.
  • LLama-3.1-8B-Instruct showed strong scaling at 2,658.57 tokens/s.
  • Qwen2.5-14B-Instruct remained the slowest at 1,238.37 tokens/s. Compared with 833.44 tokens/second for 50 concurrent requests, it has improved a lot, indicating that it has not reached the bottleneck when 50 concurrent requests are received. A6000 can support 100 concurrent requests for the 14b model.

Insights for A6000 vLLM Performance

✅ Model Size Considerations

The 7B/8B models consistently outperformed their larger 14B counterparts in throughput metrics while maintaining competitive latency figures. The DeepSeek distilled models showed particularly strong performance relative to their base model sizes.

✅ Supports 14B Models with High Concurrency

Both Qwen 2.5-14B and DeepSeek R1-14B were successfully hosted on a single A6000, even under 100 concurrent requests, with reasonable latency and high throughput. This proves that A6000 is well-suited for production-grade inference of models ≤14B, particularly for chat and generation applications.

✅ Best Performer: DeepSeek R1-8B

DeepSeek 8B outperforms others across the board with the highest throughput, lowest latency, and best scaling — making it a top-tier choice for cost-efficient A6000 hosting.

✅ LLaMA 3-8B is a Sweet Spot

The newer LLaMA 3-8B model balances cutting-edge architecture and excellent performance, trailing only slightly behind DeepSeek 8B.

Get Started with RTX A6000 Server Rental!

Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.

Flash Sale to May 6

Enterprise GPU Dedicated Server - RTX A6000

329.00/mo
40% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to May 6

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

⚠️Notes: Gemma 3-12B is not compatible with A6000

During testing, we encountered significant thermal challenges with Gemma3-12B on A6000 hardware. The GPU temperature reached 90°C and remained at 83°C for extended periods without active cooling intervention. Neither Ollama nor vLLM could successfully run Gemma3-12B on our A6000 setup. This suggests that Gemma3's multi-modal architecture may require more powerful GPUs like H100, RTX 4090, or RTX 5090 for stable operation.

Conclusion: A6000 is the Cheapest Choice for 7B-14B LLMs

The benchmark results demonstrate that A6000 GPUs with vLLM backend can effectively host medium-sized LLMs (7B-14B parameters) for production workloads. The DeepSeek-R1-Distill-LLama-8B model showed particularly strong performance across all metrics, making it an excellent choice for A6000 deployments requiring balanced throughput and latency.

For applications prioritizing throughput over model size, the 7B/8B class models generally provide better performance characteristics on A6000 hardware. The 14B models remain viable for applications where model capability outweighs pure performance metrics.

Attachment: Video Recording of A6000 vLLM Benchmark

Screenshot: A6000 vLLM benchmark with 50 Concurrent Requests
Qwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-Instructmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
Screenshot: A6000 vLLM benchmark with 100 Concurrent Requests
Qwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-Instructmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size(GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • Successful Requests: The number of requests processed.
  • Benchmark duration(s): The total time to complete all requests.
  • Total input tokens: The total number of input tokens across all requests.
  • Total generated tokens: The total number of output tokens generated across all requests.
  • Request (req/s): The number of requests processed per second.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
  • Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
  • P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
  • Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
  • P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
  • Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
  • P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.
Tags:

A6000 vLLM benchmark, vLLM A6000 performance, A6000 LLMs multi-concurrency, A6000 hosting LLMs, Qwen vs LLaMA-3 A6000, DeepSeek-R1 A6000 benchmark, Best LLM for A6000, A6000 token throughput, LLM inference speed A6000, vLLM vs Ollama A6000