Bechmarking LLMs on Ollama with Nvidia A100 40GB GPU: Extreme Performance for Models Under 70B

In the world of Large Language Models (LLMs), having the right infrastructure is critical to achieving high performance without overspending on hardware. For AI workloads that require large-scale model inference, Nvidia A100 40GB GPUs offer a powerful solution. This article will evaluate the performance of running LLMs on Ollama using a dedicated Nvidia A100 40GB GPU server.

The A100 40GB GPU is known for its exceptional performance with models under 70B. This server configuration is priced at $799/month, offering an optimal balance between performance and cost for AI developers and businesses running demanding language models. Let's take a closer look at the server's performance and why it stands out for multi-concurrent LLM inference tasks.

Server Specifications

Here are the key specs of the Nvidia A100 40GB GPU server used for testing:

Server Configuration:

  • Price: $799.00/month
  • CPU: Dual 18-Core E5-2697v4 (36 cores & 72 threads)
  • RAM: 256GB
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 1Gbps
  • OS: Windows 11 Pro

GPU Details:

  • GPU: Nvidia A100 40GB
  • Compute Capability: 8.0
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS

This server setup is highly efficient for running demanding LLMs while providing a cost-effective solution for businesses that require high-performance inference without the premium costs associated with more powerful GPUs.

Benchmark Results: Ollama GPU A100 40GB Performance Metrics

For our tests, we ran various models such as DeepSeek-R1, Qwen, and LLaMA using Ollama 0.5.7. The following table showcases the performance of the A100 40GB GPU with these models:
Modelsdeepseek-r1deepseek-r1deepseek-r1llama3.1llama2llama3qwen2.5qwen2.5qwengemma2falcon
Parameters8b14b32b8b13b70b14b32b32b27b40
Size4.99204.97.440920181624
Quantization44444444444
Running onOllama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7
Downloading Speed(mb/s)113113113113113113113113113113113
CPU Rate3%2%2%3%3%3%3%2%1%1%3%
RAM Rate3%4%4%4%4%4%4%4%4%4%4%
GPU UTL74%74%81%25%86%91%73%83%84%80%88%
Eval Rate(tokens/s)108.1861.3335.01106.7293.6124.0964.9835.4442.0546.1737.27
A video to record real-time A100 40GB gpu server resource consumption data:
Screen Shoots for Bechmarking LLMs on Ollama with Nvidia A100 40GB GPU
ollama run deepseek-r1:8bollama run deepseek-r1:14bollama run deepseek-r1:32bollama run llama3.1:8bollama run llama2:13bollama run llama3:70bollama run qwen2.5:14bollama run qwen2.5:32bollama run qwen:32bollama run gemma2:27bollama run falcon:40b

Analysis: Performance and Cost Effectiveness

1. Ultimate Performance with 14B Models:

When processing 8-14b models, the A100 40gb gpu showed extreme performance, with tokens reaching 60-110 per second. This speed will not be a problem when processing multiple concurrency, and is very suitable for business scenarios with multiple requests per minute.

2. Ease with 32B Models:

The A100 40GB can handle 32B models with ease, offering great GPU utilization (80%+) and solid evaluation rates (up to 35.01 tokens/s). This makes it an excellent choice for developers working with 32B LLMs such as DeepSeek-R1 and Qwen.

3. 40GB Memory limitation:

As the model size increases to 70B, the A100 40GB still performs well, but the GPU memory limitation becomes apparent. From the test results, llama3:70b is the only model that can be run on the A100 40GB server, thanks to the model size of only 39GB.

4. Cost vs Performance:

At $799/month, the A100 40GB offers a strong performance-to-cost ratio. It’s significantly cheaper than the H100, yet it still delivers excellent performance for 32B models. Given the price point, this makes it a perfect choice for users who need a high-performance server without the steep costs of cutting-edge GPUs like H100.

Conclusion

The Nvidia A100 40GB GPU server offers a cost-effective and high-performance solution for running LLMs like DeepSeek-R1, Qwen, and LLaMA with 32B parameters. It handles mid-range models well, offering excellent performance and scalable hosting for AI inference tasks. This setup is ideal for businesses looking to manage multiple concurrent requests at an affordable price.

Although it can process models with llama3:70B parameters at 24 tokens/s, it will not be able to process models larger than 40GB (such as other 70b and 72B models).

For developers and enterprises that require efficient and high-quality AI model hosting for mid-range models, the A100 40GB server is an outstanding choice that offers a balance of cost and performance.

Get Started with Nvidia A100 Hosting for LLMs

Ready to harness the power of Nvidia A100 40GB GPUs for your AI applications? Explore our dedicated hosting options today and get the best performance at an unbeatable price.

Enterprise GPU Dedicated Server - A100

639.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - RTX A6000

384.00/mo
30% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
Tags:

Nvidia A100, LLM hosting, AI server, Ollama, AI performance, A100 server, DeepSeek-R1, Qwen model, LLM inference, Nvidia A100 GPU, A100 hosting