Evaluation: Performance of Running LLMs with Ollama on Nvidia A40 GPU Server

As large language models (LLMs) continue to advance, researchers and enterprises increasingly seek high-performance hardware for hosting and running these models. This report evaluates the performance of Nvidia A40 GPUs when running LLMs with the Ollama platform, offering detailed analysis and insights from practical data.

Test Environment and Configuration

Here’s the detailed specification of the Nvidia A40 hosting server used in our tests:

Server Configuration:

  • CPU: Dual 18-Core E5-2697v4 (36 cores, 72 threads)
  • RAM: 256GB
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 100Mbps-1Gbps connection
  • OS: Windows 11 Pro

GPU Details:

  • GPU: Nvidia A40
  • Compute Capability: 8.6
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 37.48 TFLOPS

With ultra-high vRAM(48GB) ensures that the A40 Server can run 70b models.

LLMs Reasoning Tested on Ollama with A40

The tests included various models, such as Meta's LLaMA series (70b, 34b, etc.), Qwen series (32b, 72b), and others like Llava and Gemma2. Each model was run with 4-bit quantization to optimize memory and performance, testing the following language models:
  • Llama2 (70B)
  • Llama3 (70B)
  • Llama3.1 (70B)
  • Llama3.3 (70B)
  • Qwen (32B, 72B)
  • Qwen2.5 (14B, 32B, 72B)
  • Gemma2 (27B)
  • Llava (34B)
  • QWQ (32B)

Benchmark Results: Ollama GPU A40 Performance Metrics

The Nvidia A40 performed well, particularly with medium-sized models. Key metrics are summarized below:
Modelsllama2llama3llama3.1llama3.3qwenqwenqwen2.5qwen2.5qwen2.5gemma2llavaqwq
Parameters70b70b70b70b32b72b14b32b72b27b34b32b
Size39GB40GB43GB43GB18GB41GB9GB20GB47GB16GB19GB20GB
Quantization444444444444
Running onOllama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4
Downloading Speed(mb/s)111111111111111111111111
CPU Rate2%2%3%2%3%17-22%2%2%30-40%3%3%2%
RAM Rate3%3%3%3%3%3%3%3%3%3%3%4%
GPU UTL98%94%94%94%90%66%83%92%42-50%89%94%90%
Eval Rate(tokens/s)13.5213.1512.0912.1024.888.4644.5923.045.7829.1725.8423.11
A video to record real-time A40 gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run llama2:70bollama run llama3:70bollama run llama3.1:70bollama run llama3.3:70bollama run qwen:32bollama run qwen:72bollama run qwen2.5:14bollama run qwen2.5:32bollama run qwen2.5:72bollama run gemma2:27bollama run llava:34bollama run qwq:32b

Key Insights

1. Upper Limits of A40's Processing Capability

Nvidia A40's 48GB memory allows stable operation of models up to 70 billion parameters (e.g., LLaMA2 and LLaMA3). At this scale, GPU utilization reached 94%-98%, with an evaluation rate of 12-13 tokens/s.
However, for models like Qwen:72b, despite sufficient memory, the evaluation rate dropped significantly to 8.46 tokens/s, indicating the GPU's performance ceiling. This suggests that the A40 is not ideal for sustained operations of ultra-large models (72b or higher).

2. Performance on Medium-Scale Models: 32b-34b

For models with 32b-34b parameters (e.g., Qwen:32b, Llava:34b), the A40 performed exceptionally well:
  • GPU utilization: Stable at 90%-94%
  • Evaluation rate: 23-26 tokens/s
This makes the A40 well-suited for medium-scale models, delivering high inference speed while efficiently utilizing memory and compute resources.

3. High Efficiency for Small Models

For smaller models (e.g., Qwen2.5:14b), memory and compute demands decreased significantly. In these cases:
  • GPU utilization: 42%-83%
  • Evaluation rate: 44.59 tokens/s
The A40 demonstrates excellent value for scenarios requiring multiple simultaneous smaller model operations.

Advantages of Nvidia A40

1. Efficient Memory Management

With 48GB GDDR6, the A40 can stably run models up to 70b parameters and supports 4-bit quantization to save memory.

2. High Inference Performance

When handling medium-sized models (32b-34b), evaluation rates of 25 tokens/s significantly boost inference efficiency.

3. Versatile Adaptability

Whether running ultra-large or small models, the A40 dynamically allocates resources according to task requirements.

Limitations and Recommendations

1. Limited Support for Ultra-Large Models

For models exceeding 70b parameters, the A40's performance is constrained. For such workloads, consider GPUs like the A100 or H100.

2. Network Bottleneck

The download speed during tests averaged 11MB/s, potentially delaying large model loading. Upgrading to faster network environments is advisable.

Comparison: Nvidia A40 vs A6000

Both the Nvidia A40 and Nvidia A6000 come equipped with 48GB of VRAM, making them capable of running models up to 70 billion parameters with similar performance. In practical scenarios, these GPUs are nearly interchangeable for tasks involving models like LLaMA2:70b, with evaluation rates and GPU utilization showing minimal differences.

However, when it comes to models exceeding 72 billion parameters, neither the A40 nor the A6000 can maintain sufficient evaluation speed or stability. For such ultra-large models, GPUs with higher memory capacities, such as the A100 80GB or the H100, are highly recommended. These GPUs offer significantly better memory bandwidth and compute capabilities, ensuring smooth performance for next-generation LLMs.

The A100 80GB is a particularly cost-effective choice for hosting multiple large models simultaneously, while the H100 represents the pinnacle of performance for AI workloads, ideal for cutting-edge research and production.

Enterprise GPU Dedicated Server - A40

439.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A40
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 37.48 TFLOPS
  • Ideal for hosting AI image generator, deep learning, HPC, 3D Rendering, VR/AR etc.
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - RTX 4090

302.00/mo
44% Off Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - RTX A6000

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.

Multi-GPU Dedicated Server- 2xRTX 4090

729.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS

Conclusion: How Well Does Nvidia A40 Perform for LLMs?

Overall, the Nvidia A40 is a highly cost-effective GPU, especially for medium-sized and small LLM inference tasks. Its 48GB VRAM enables stable support for models up to 70b parameters, with evaluation rates reaching 13 tokens/s, while achieving even better results with 32b-34b models.

If you're looking for a GPU server to host LLMs, the Nvidia A40 is a strong candidate. It delivers excellent performance at a reasonable cost, making it suitable for both model development and production deployment.

Tags:

A40 benchmark, Nvidia A40, Ollama benchmark, LLM A40, A40 test, A40 GPU, Nvidia A40 GPU, A40 hosting, A40 vs A6000, LLM hosting, Nvidia A40 server, A100 vs A40, H100 vs A40, Ollama A40