A40 vLLM Benchmark: 14B LLMs with 50 & 100 Concurrent Requests

GPU Server Promotion, Up to 59% OFF, Order Now>



AI Solution

Test Overview

1. A40 GPU Details:

GPU: Nvidia A40
Microarchitecture: Ampere
Compute capability: 8.6
CUDA Cores: 10,752
Tensor Cores: 336
Memory: 48GB GDDR6
FP32 performance: 37.48 TFLOPS

2. Test Project Code Source：

We used this git project to build the environment（https://github.com/vllm-project/vllm）

3. The Following Models from Hugging Face were Tested:

meta-llama/Llama-3.1-8B-Instruct
deepseek-ai/DeepSeek-R1-Distill-Llama-8B
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
Qwen/Qwen2.5-14B-Instruct
Qwen/Qwen2.5-7B-Instruct
google/gemma-3-12b-it

4. The Test Parameters are Preset as Follows:

Input length: 100 tokens
Output length: 600 tokens

5. We conducted two rounds of A40 vLLM tests under different concurrent request loads:

Scenario 1: 50 concurrent requests
Scenario 2: 100 concurrent requests

A40 Benchmark for Scenario 1: 50 Concurrent Requests

Models	meta-llama/Llama-3.1-8B-Instruct	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	Qwen/Qwen2.5-14B-Instruct	Qwen/Qwen2.5-7B-Instruct	google/gemma-3-12b-it
Quantization	16	16	16	16	16	16
Size（GB）	15	15	28	28	15	23
Backend/Platform	vLLM	vLLM	vLLM	vLLM	vLLM	vLLM
Request Numbers	50	50	50	50	50	50
Benchmark Duration(s)	21.81	23.81	38.77	47.58	23.48	65.02
Total Input Tokens	5000	5000	5000	5000	5000	5000
Total Generated Tokens	22170	27517	17321	25213	25703	27148
Request (req/s)	2.29	2.10	1.29	1.05	2.13	0.77
Input (tokens/s)	229.27	210.02	128.96	105.08	212.93	76.9
Output (tokens/s)	1016.60	1155.80	446.73	529.88	1094.54	417.54
Total Throughput (tokens/s)	1245.87	1365.82	575.69	634.96	1307.47	494.44
Median TTFT(ms)	219.88	756.72	1153.98	1434.29	761.11	401.22
P99 TTFT(ms)	242.82	940.90	1486.31	1847.22	983.28	573.97
Median TPOT(ms)	35.96	38.41	63.45	76.88	37.82	107.74
P99 TPOT(ms)	36.12	41.08	95.96	80.39	190.41	115.72
Median Eval Rate(tokens/s)	27.81	26.03	15.76	13.01	26.44	9.28
P99 Eval Rate(tokens/s)	27.69	24.34	10.42	12.44	5.25	8.64

✅ Key Takeaways:

8B models (Llama-3, DeepSeek, Qwen-7B) perform optimally.
14B models are usable but slower.
The Gemma-3-12B and DeepSeek 14B models begin to push the limits of the GPU memory, especially when using longer sequence lengths.
At 50 requests, the A40 shows strong baseline performance across models, especially in token generation and throughput. However, it’s clear from later results that the GPU is not fully saturated at this level for most models.

A6000 Benchmark for Scenario 2: 100 Concurrent Requests

Models	meta-llama/Llama-3.1-8B-Instruct	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	Qwen/Qwen2.5-14B-Instruct	Qwen/Qwen2.5-7B-Instruct	google/gemma-3-12b-it
Quantization	16	16	16	16	16	16
Size（GB）	15	15	28	28	15	23
Backend/Platform	vLLM	vLLM	vLLM	vLLM	vLLM	vLLM
Request Numbers	100	100	100	100	100	100
Benchmark Duration(s)	26.61	40.83	45.25	87.61	42.94	123.86
Total Input Tokens	10000	10000	10000	10000	10000	10000
Total Generated Tokens	45375	54430	37135	51177	48267	50548
Request (req/s)	3.76	2.45	2.21	1.14	2.33	0.81
Input (tokens/s)	375.86	244.94	220.99	114.14	241.87	80.74
Output (tokens/s)	1705.49	1333.21	820.62	584.14	1123.99	408.12
Total Throughput (tokens/s)	2081.35	1578.15	1041.61	698.28	1356.86	488.86
Median TTFT(ms)	487.81	671.18	546.93	1426.07	1093.55	1865.75
P99 TTFT(ms)	1065.56	1422.67	1838.40	5381.54	2443.18	4022.71
Median TPOT(ms)	43.43	66.86	74.32	143.52	69.68	154.06
P99 TPOT(ms)	64.93	67.43	118.59	166.48	793.62	1091.31
Median Eval Rate(tokens/s)	23.03	14.86	13.45	6.97	14.35	6.49
P99 Eval Rate(tokens/s)	43.42	14.83	8.43	6.01	1.26	0.92

✅ Key Takeaways:

A40 can handle 8B models efficiently even at high concurrency.
14B models (Qwen/DeepSeek) are usable but with higher latency.
Gemma-3-12B is not recommended for A40 due to poor scaling.
The A40 successfully handled 100 concurrent requests with all tested 14B models, proving that it is viable for real-world LLM inference at scale, especially for models like DeepSeek and Qwen.
Doubling to 100 requests shows a clear rise in token throughput — indicating that the A40 still had headroom at 50 requests, particularly for 14B models.

Insights for A40 vLLM Performance

✅ Best for 7B-8B Models:

DeepSeek-R1-Distill-Llama-8B offers the best balance of speed & efficiency. Llama-3.1-8B is a strong alternative with good scaling.

✅ 14B Models Possible but Slower:

Requires --max-model-len 4096 in vLLM for stability. Higher latency at 100+ requests—best for moderate workloads.

✅ Avoid Gemma3-12b 100 requests on A40

While testable, the Gemma-3-12B model demonstrates elevated latency and poor efficiency, suggesting architectural incompatibility or high resource demands.

⚠️Operational Notes

Enable --max-model-len 4096 when running 14B models. Without it, vLLM will fail to load them.
Temperature Warning: A40 can reach 80°C+ under load. It is recommended to manually trigger fan cooling for 5 minutes during high workloads.
The data of Gemma3-12b is not ideal and can only support 50 concurrent requests. If more concurrent requests are required, it is recommended to use a100 40gb to infer Gemma3-12b.

Get Started with Nvidia A40 Server Rental!

Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 4090
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - A40

$ 439.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A40
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 37.48 TFLOPS

Ideal for hosting AI image generator, deep learning, HPC, 3D Rendering, VR/AR etc.

Enterprise GPU Dedicated Server - A100

$ 639.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 40GB HBM2
FP32 Performance: 19.5 TFLOPS

Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

$ 1559.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia A100
Microarchitecture: Ampere
CUDA Cores: 6912
Tensor Cores: 432
GPU Memory: 80GB HBM2e
FP32 Performance: 19.5 TFLOPS

Conclusion: A40 is the Cheapest Choice for 14B LLMs

The NVIDIA A40 is excellent for 7B-8B models (DeepSeek, Llama-3, Qwen-7B) even at 100+ concurrent requests. While 14B models work, they require optimizations (--max-model-len 4096) and thermal monitoring.

The A40 is a robust and cost-effective GPU for LLM inference up to 14B, comfortably supporting up to 100 concurrent requests. With minor tuning, it is well-suited for multi-user inference workloads and production-scale deployments of open-weight LLMs. For scalable hosting of LLaMA, Qwen, DeepSeek and other 7–14B models, the A40 stands out as a reliable mid-tier GPU choice.

Attachment: Video Recording of A40 vLLM Benchmark

Screenshot: A40 vLLM benchmark with 50 Concurrent Requests

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Screenshot: A40 vLLM benchmark with 100 Concurrent Requests

Data Item Explanation in the Table:

Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
Size(GB): Model size in GB.
Backend: The inference backend used. In this test, vLLM is used.
Successful Requests: The number of requests processed.
Benchmark duration(s): The total time to complete all requests.
Total input tokens: The total number of input tokens across all requests.
Total generated tokens: The total number of output tokens generated across all requests.
Request (req/s): The number of requests processed per second.
Input (tokens/s): The number of input tokens processed per second.
Output (tokens/s): The number of output tokens generated per second.
Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.

Tags:

A40 Hosting, A40 vLLM benchmark, A40 LLM performance, A40 vs A6000 for LLMs, Best LLM for A40 GPU, vLLM A40 throughput, A40 14B model benchmark, A40 thermal management LLMs, DeepSeek vs Llama-3 A40, Qwen2.5 A40 speed test, Gemma-3 A40 compatibility

A40 vLLM Benchmark: 7B to 14B LLMs Under 50 & 100 Concurrent Requests

Test Overview

1. A40 GPU Details:

2. Test Project Code Source：

3. The Following Models from Hugging Face were Tested:

4. The Test Parameters are Preset as Follows:

5. We conducted two rounds of A40 vLLM tests under different concurrent request loads:

A40 Benchmark for Scenario 1: 50 Concurrent Requests

✅ Key Takeaways:

A6000 Benchmark for Scenario 2: 100 Concurrent Requests

✅ Key Takeaways:

Insights for A40 vLLM Performance

✅ Best for 7B-8B Models:

✅ 14B Models Possible but Slower:

✅ Avoid Gemma3-12b 100 requests on A40

⚠️Operational Notes

Get Started with Nvidia A40 Server Rental!

Conclusion: A40 is the Cheapest Choice for 14B LLMs

Attachment: Video Recording of A40 vLLM Benchmark

Data Item Explanation in the Table: