Benchmarking LLMs on Ollama with Dual Nvidia A100 GPUs(Total 80GB): Best Choice for 70B~110B Models

When it comes to running Large Language Models (LLMs), having the right server configuration is crucial to balance performance and cost. In this article, we explore the performance of running LLMs on Ollama using dual Nvidia A100 GPUs. Specifically, we will examine the ability of the A100*2 setup to handle models ranging from 70B to 110B parameters, including popular models like DeepSeek-R1, Qwen, and LLaMA.

This setup is priced at $1399/month and offers a solid performance-to-cost ratio for AI projects requiring large-scale computations. Let's dive into the benchmark results to understand how dual A100 GPUs handle these demanding tasks.

Server Configuration: Dual Nvidia A100 GPUs

Here are the key specifications of the dual Nvidia A100 GPU server used in our tests:

Server Configuration:

  • Price: $1399/month
  • CPU: Dual 18-Core E5-2697v4 (36 cores & 72 threads)
  • RAM: 256GB RAM
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 1Gbps
  • OS: Windows 10 Pro

GPU Details:

  • GPU: Dual Nvidia A100
  • Microarchitecture: Ampere
  • Compute Capability:8.0
  • CUDA Cores: 6912 per card
  • Tensor Cores: 432
  • Memory: 40GB HBM2 per card
  • FP32 Performance: 19.5 TFLOPS per card

The dual A100 GPUs provide a combined 80GB of GPU memory, ideal for running large language models efficiently. This configuration allows us to process models with parameter counts as high as 110B with reasonable speed and efficiency.

👉 Benchmarking LLMs with Dual A100 GPUs

We tested various models, including DeepSeek-R1 14B~70B, llama 72b, and Qwen 32B~110B using Ollama 0.5.7. Below is a breakdown of the performance results for the A100*2 setup:
Modelsdeepseek-r1deepseek-r1deepseek-r1qwenqwenqwenqwen2llama3llama3.1llama3.3
Parameters14b32b70b32b72b110b72b70b70b70b
Size9GB20GB43GB18GB41GB63GB41GB40GB43GB43GB
Quantization4444444444
Running onOllama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7
Downloading Speed(mb/s)117117117117117117117117117117
CPU Rate0%2%3%2%1%1%1%2%1%1%
RAM Rate4%4%4%4%4%4%4%3%4%3%
GPU UTL(2 Cards)0%, 80%0%, 88%44%, 44%36%, 39%42%, 45%50%, 50%38%, 37%92%, 0%44%, 43%44%, 43%
Eval Rate(tokens/s)66.7936.3619.3432.0720.1316.0619.8824.4119.0118.91
Record real-time Dual A100 gpu server resource consumption data:
Screen shoots for Running LLMs on Ollama with Dual Nvidia A100 GPUs
ollama run deepseek-r1:14bollama run deepseek-r1:32bollama run deepseek-r1:70bollama run qwen:32bollama run qwen:72bollama run qwen:110bollama run qwen2:72bollama run llama3:70bollama run llama3.1:70bollama run llama3.3:70b

Analysis & Insights: Dual A100 GPUs Performance

1️⃣. High Performance Choice for 70b-110b

The dual Nvidia A100 GPUs demonstrate impressive performance when running Ollama with LLMs of varying sizes, especially for models up to 70B parameters. The 80GB of memory provided by the two GPUs ensures that these models can be loaded and processed efficiently, with 40%~50% GPU utilization.
  • DeepSeek-R1 runs efficiently with 66.79 tokens/s for 14B parameters and 36.36 tokens/s for 32B parameters.
  • Qwen and Llama:72B shows solid performance with ~20 tokens/s.
  • When processing the 110B models like Qwen, dual A100 GPUs experience a slight reduction in evaluation speed (16.06 tokens/s). Despite the drop in performance, the dual A100 setup still offers a cost-effective solution compared to the H100, especially for users running AI workloads on a budget.

2️⃣️. GPU Utilization in Dual GPU Setup

When running large models across two A100 GPUs, GPU utilization drops significantly compared to using a single GPU. In the case of DeepSeek-R1:32B and Qwen:32B, we observed that GPU utilization fluctuates between 40% and 45% per GPU, indicating that the load is being split between the two GPUs, reducing overall performance efficiency.
In contrast, single GPU configurations (with single A100 GPUs) can provide better utilization and faster inference, as the model is loaded into the entire GPU memory. Splitting the model across two GPUs means higher memory overhead and increased PCI bus traffic, which can introduce some latency.

3️⃣. Memory Distribution in Multi-GPU Configuration

Memory distribution in a multi-GPU setup can have its benefits and challenges. While having 80GB of GPU memory available (via two 40GB A100 cards) allows you to handle larger models like 70B and 110B models, the model must be split across GPUs, which can introduce inefficiencies. This is why smaller models (like DeepSeek-R1:32B) perform relatively well in this setup, while larger models like Qwen:110B show a performance drop due to the increased overhead from the memory split.

4️⃣. A100 vs. H100

At $1399/month, the dual A100 GPU setup provides a strong performance-to-cost ratio. The A100's ability to handle models up to 110B is impressive, but if you're aiming for top-tier performance in extremely large models (e.g., 110B+), the H100 may be more suitable, although at nearly double the cost.

📢 Get Started with Dual A100 GPU Hosting for LLMs

If you're looking to optimize your AI inference tasks for 70b~110b, explore our dedicated server hosting options today!
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - RTX A6000

384.00/mo
30% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
Flash Sale to Mar.16

Multi-GPU Dedicated Server - 2xA100

951.00/mo
32% OFF Recurring (Was $1399.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - H100

1819.00/mo
30% OFF Recurring (Was $2599.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

Conclusion

The Dual Nvidia A100 GPU server is a powerful and cost-effective solution for running LLMs with parameter sizes up to 110B. It offers excellent performance for mid-range to large models like Qwen:32B, DeepSeek-R1:70B, and Qwen:72B, with a significant price advantage over higher-end GPUs like the H100.

For users who need to process large-scale models without the steep costs of more premium GPUs, A100*2 hosting offers a compelling option that balances performance and affordability.

Tags:

Nvidia A100, LLM hosting, AI server, Ollama, performance analysis, GPU server hosting, DeepSeek-R1, Qwen model, AI performance, A100 server, Nvidia GPUs