Exploring Ollama's Performance on A5000 GPU Servers: A Comprehensive Benchmark

As the AI landscape evolves, the demand for high-performance hardware to deploy and host large language models (LLMs) like DeepSeek-R1, Llama2, and others continues to grow. This article explores Ollama's performance on an NVIDIA Quadro RTX A5000-powered server, analyzing benchmark results and evaluating the suitability of this setup for hosting LLMs.

Server Specifications

The benchmarks were conducted on the following hardware and software setup:

Server Configuration:

  • CPU: Dual 12-Core E5-2697v2 (24 cores, 48 threads)
  • Memory: 128GB RAM
  • Storage: 240GB SSD + 2TB SSD
  • Operating System: Windows 11 Pro
  • Network: 100Mbps~1Gbps bandwidth
  • Software: Ollama versions 0.5.7

GPU Details:

  • GPU: Nvidia Quadro RTX A5000
  • Compute Capability: 8.6
  • Microarchitecture: Ampere
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPS

This robust configuration positions the A5000 as a top-tier GPU server for AI applications, balancing performance, memory, and compatibility with modern LLM frameworks.

Benchmark Results for LLMs on A5000

The following table summarizes the benchmark performance of several models running on Ollama:
Modelsdeepseek-r1deepseek-r1llama2qwenqwen2.5qwen2.5gemma2mistral-smallqwqllava
Parameters14b32b13b32b14b32b27b22b32b34b
Size7.4GB4.9GB8.2GB18GB9GB20GB9.1GB13GB20GB19GB
Quantization4444444444
Running onOllama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7
Downloading Speed(mb/s)12121212121212121212
CPU Rate3%3%3%3%3%3%3%3%3%3%
RAM Rate6%6%6%6%6%6%6%5%6%6%
GPU vRAM43%90%60%72%36%90%80%50%80%78%
GPU UTL95%97%97%96%94%92%93%97%97%96%
Eval Rate(tokens/s)45.6324.2160.4926.0645.5223.9328.7937.0724.1427.16
A video to record real-time A5000 gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run deepseek-r1:14bollama run deepseek-r1:32bollama run llama2:13bollama run qwen:32bollama run qwen2.5:14bollama run qwen2.5:32bollama run gemma2:27bollama run mistral-small:22bollama run qwq:32bollama run llava:34b

Performance Analysis

1. Token Evaluation Rate

The evaluation rate (tokens/s) is a critical metric for LLM hosting, reflecting how quickly the model processes inputs and generates outputs:
  • Llama2 (13b) leads the pack with an impressive rate of 60.49 tokens/s, thanks to its smaller parameter size and efficient architecture.
  • DeepSeek-R1 (14b) performs admirably with 45.63 tokens/s, leveraging its optimized quantization to achieve fast processing speeds.
  • Larger models like Gemma2 (27b) and LLaVA (34b) show moderate rates, indicating their suitability for tasks requiring higher accuracy over speed.

2. GPU Utilization and Memory Usage

  • The A5000's 24GB GDDR6 memory provides ample room for most models, with DeepSeek-R1 (32b) utilizing 90% VRAM at peak.
  • Utilization rates remain consistently high (93%-97%), demonstrating efficient GPU resource usage across all tested models.

3. CPU and RAM Efficiency

  • The CPU rate remains low at 3% for all models, highlighting the GPU-centric nature of LLM hosting on this server.
  • RAM usage hovers around 6%, ensuring the system's memory remains available for additional workloads.

Advantages of Hosting LLMs on A5000

1. Versatile Performance: The A5000 handles both lightweight models like Llama2 and heavyweights like Gemma2 without performance bottlenecks.
2. Efficient Resource Allocation: High GPU utilization and low CPU/RAM overhead make this server ideal for multitasking.
3. Future-Proof Design: With 24GB VRAM, the A5000 supports larger models and advanced quantization techniques.

Optimal Use Cases

Based on the benchmark results, here are some recommended use cases for LLM hosting on the A5000:
  • Real-Time Applications: Models like Llama2 and Qwen2.5 deliver high token evaluation rates, making them suitable for chatbots and conversational AI.
  • Research and Development: The flexibility to host diverse models like DeepSeek-R1 and Mistral-Small enables experimentation with state-of-the-art architectures.
  • Content Generation: Larger models such as Gemma2 and LLaVA are ideal for tasks requiring rich, context-aware outputs.

Get Started with A5000 Dedicated Server

The NVIDIA Quadro RTX A5000 significantly outperforms older GPUs. When hosting an LLM like DeepSeek-R1 (14B), the A5000 evaluates at 45.63 tokens/s, nearly double the P100's 18.99 tokens/s. Its powerful 24GB of GDDR6 memory and 19.2 TFLOPS of FP32 performance make it an obvious choice over older GPUs or entry-level graphics cards for high-load AI tasks.
Flash Sale to Mar.16

Professional GPU Dedicated Server - P100

129.00/mo
35% Off Recurring (Was $199.00)
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 10-Core E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Tesla P100
  • Microarchitecture: Pascal
  • CUDA Cores: 3584
  • GPU Memory: 16 GB HBM2
  • FP32 Performance: 9.5 TFLOPS
  • Suitable for AI, Data Modeling, High Performance Computing, etc.

Advanced GPU Dedicated Server - A5000

349.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A5000
  • Microarchitecture: Ampere
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPS
  • $174.5 first month, then enjoy a 20% discount for renewals.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - A100

639.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Conclusion

The NVIDIA Quadro RTX A5000, paired with Ollama, is a powerhouse for LLM hosting. Its exceptional GPU performance, efficient resource usage, and flexibility make it a top choice for developers, researchers, and enterprises deploying AI solutions.

Whether you're running DeepSeek-R1, Llama2, or other cutting-edge models, the A5000 delivers the performance you need to unlock their full potential. For AI enthusiasts and professionals alike, this GPU server represents a smart investment in the future of machine learning.

Tags:

NVIDIA A5000, Ollama benchmark, LLM hosting, DeepSeek-R1, Llama2, AI GPU server, GPU performance test, AI hardware, Language model hosting, AI research tools