Benchmarking Nvidia Quadro RTX A6000: Running LLMs on a GPU Server with Ollama

The Nvidia Quadro RTX A6000 is a powerhouse GPU known for its exceptional performance in AI and machine learning tasks. In this article, we delve into its performance when running Large Language Models (LLMs) on a GPU dedicated server. The benchmarks utilize the Ollama environment, testing models such as Llama2, Qwen, and others.

Server Specifications

Our test environment is a high-performance GPU dedicated server equipped with:

Server Configuration:

  • Price: $549.00/month
  • CPU: Dual 18-Core E5-2697v4 (36 cores, 72 threads)
  • RAM: 256GB
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 100Mbps-1Gbps connection
  • OS: Windows 10 Pro

GPU Details:

  • GPU: Nvidia Quadro RTX A6000
  • Compute Capability: 8.6
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS

These specifications provide an excellent foundation for LLM testing using Ollama.

LLMs Reasoning Tested on Ollama with A6000

Ollama, a popular platform for running LLMs, offers robust support for high-performance GPUs like the Nvidia Quadro RTX A6000. This combination allows for testing a variety of LLMs with different parameter sizes and quantization levels. We tested the following models:
  • Llama2 (13B, 70B)
  • Llama3 (70B)
  • Llama3.3 (70B)
  • Qwen (32B, 72B)
  • Qwen2.5 (14B, 32B)
  • Gemma2 (27B)
  • Llava (34B)
  • QWQ (32B)
  • Phi4 (14B)

Benchmark Results: Ollama GPU A6000 Performance Metrics

The table below showcases the performance metrics, including CPU, RAM, GPU utilization, and evaluation rates for each LLM tested. The server demonstrated outstanding stability and efficiency across all tests.
Modelsllama2llama2llama3llama3.3qwenqwenqwen2.5qwen2.5gemma2llavaqwqphi4
Parameters13b70b70b70b32b72b14b32b27b34b32b14b
Size7.4GB39GB40GB43GB18GB41GB9GB20GB16GB19GB20GB9.1GB
Quantization444444444444
Running onOllama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4
Downloading Speed(mb/s)111111111111111111111111
CPU Rate3%3%3%3%5%3%3%3%3%3%3%3%
RAM Rate3%3%3%3%4%3%3%4%3%4%4%4%
GPU vRAM30%85%88%91%42%91%22%67%40%85%91%70%
GPU UTL87%96%94%94%89%94%83%89%84%68%89%83%
Eval Rate(tokens/s)63.6315.2814.6713.5627.9614.5150.3226.0831.5928.6725.5752.62
Record real-time gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run llama2:13bollama run llama2:70bollama run llama3:70bollama run llama3.3:70bollama run qwen:32bollama run qwen:72bollama run qwen2.5:14bollama run qwen2.5:32bollama run gemma2:27bollama run llava:34bollama run qwq:32bollama run phi4:14b

Key Insights

1. LLM Compatibility

The A6000 seamlessly handles a wide range of LLMs, including large models like Llama2 70B, Qwen 72B, and compact models like Qwen2.5 14B.

2. Evaluation Rate

Smaller models (e.g., Llama2 13B, Phi4 14B) achieved higher token evaluation rates due to their reduced computational requirements. Larger models maintained stability with slightly lower rates.

3. GPU Utilization

The A6000 consistently operated at high GPU utilization, demonstrating its capacity for demanding workloads.

4. vRAM Usage

Models with larger parameters, such as Llama2 70B, leveraged up to 91% of the 48GB GDDR6 memory, showcasing the A6000's capability to handle memory-intensive tasks.
MetricValue for Various Models
Downloading Speed11 MB/s for all models, 118 MB/s When a 1gbps bandwidth add-on ordered.
CPU Utilization RateMaintain 3%
RAM Utilization RateMaintain 3%
GPU vRAM Utilization22%-91%. The larger the model, the higher the utilization rate.
GPU Utilization80%+. Maintain high utilization rate.
Evaluation Speed13.56 - 63.63 tokens/s. The larger the model, the slower the Reasoning speed.

Performance Comparison with Other Graphics Cards

The A6000 is a professional-grade graphics card among GPU cards of the same price, and is particularly suitable for scenarios such as high-performance computing, AI reasoning, and graphic design. However, in pure LLM tasks, its performance-price ratio may not be as good as consumer-grade graphics cards (such as RTX 4090) or higher-end data center GPUs (such as A100 and H100). This makes the A6000 more targeted at professional creators or users who need graphics performance.
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - RTX 4090

302.00/mo
44% Off Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - RTX A6000

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Multi-GPU Dedicated Server - 3xRTX A6000

899.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 3 x Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS

Summary and Recommendations

The Nvidia Quadro RTX A6000, hosted on a dedicated GPU server, excels in running LLMs via Ollama. Its robust performance metrics, efficient utilization of computational resources, and compatibility with diverse models make it a top-tier option for AI developers.

If you’re looking for high-performance A6000 hosting or testing environments for LLM benchmarks, this setup offers exceptional value for both research and production use cases.

Tags:

Nvidia Quadro RTX A6000, A6000 benchmark, LLM benchmark, Ollama benchmark, A6000 GPU performance, running LLMs on A6000, Nvidia A6000 hosting, Ollama GPU test, AI GPU benchmarking, GPU for large language models, A6000 vs RTX 4090, AI GPU hosting