Running GGUF Models on Multi-GPU Dedicated Server – 3×RTX A5000

I have been a Database Mart customer for quite some time, having purchased around 20 VPSs and a few dedicated GPU servers, including one dual-GPU and one triple-GPU system. My main objective was to run GGUF models on powerful GPU hardware.
Before this, I experimented with running models on a CPU, but the results were inefficient. The move to a Multi-GPU Dedicated Server – 3×RTX A5000 significantly improved my workflow.
One of the key reasons I chose Database Mart is their flexibility and customer-friendly approach. On multiple occasions, I changed my mind shortly after purchasing a server, and their support team promptly refunded me in the form of account credits, allowing me to choose a more suitable configuration. This level of service is rare and highly appreciated.

*Submitted by user "nm......j@gmail.com"*

Application Scenario

My main use case involved running various large language models (LLMs) concurrently. On my 3×RTX A5000 (72GB total VRAM) setup, I ran:

A 27B model with a 32k context
An 8B model with a 24k context
Two 4B models — one with an 8k context and the other with a 96k context

All were running at the same time in different quantizations, using Koboldcpp in multiple versions. In total, my server stored about 2TB of LLM models on disk. The typical workload involved running REST APIs and webhooks for real-time interactions.

Server Specifications

✅ Product Name：e.g.,Express GPU Dedicated Server - P620
✅Product Name: Multi-GPU Dedicated Server – 3×RTX A5000
✅GPU: 3 x Quadro RTX A5000
✅RAM: 256GB RAM
✅CPU: Dual 18-Core E5-2697v4
✅Storage: 240GB SSD + 2TB NVMe + 8TB SATA
✅Bandwidth: 1Gbps

Deployment Process

Setting up the environment was a learning journey. Initially, I had to downgrade NVIDIA drivers to match CUDA compatibility for the A-series cards on Debian 11. I configured the environment step-by-step:
1. Installed NVIDIA driver 11.8 and related toolkit
2. Added necessary bash scripts and Node.js dependencies
3. Made numerous custom modifications for performance and usability
4. Adjusted application behavior whenever required
5. Implemented personal scripts to improve navigation and operational efficiency in the terminal

Performance Review

During peak usage, I successfully ran 7–8 models (ranging from 2B to 8B) in parallel without crashes, although GPU temperatures sometimes reached 96°C under sustained load, prompting me to manage workloads carefully.

Network Performance

Reliability Evaluation

The server ran continuously for 3 months without interruptions, except for one data center incident (unrelated to DBM’s fault) that caused a restart. Overall uptime was excellent, and the hardware proved stable under long-term load.

Resource Utilization (Approximate)

Optimization Tips

✅Be patient when setting up a GPU environment — solve problems step-by-step
✅Use LLMs themselves to troubleshoot and discover optimization ideas
✅Consider running Koboldcpp GGUF models in Docker to fully utilize a single GPU across multiple models

Notes & Troubleshooting

One challenge was that the three GPUs were connected to different CPU cores and PCIe ports, meaning I couldn’t always load a single model across all three cards simultaneously. I learned to manage VRAM usage efficiently to work around this.
For multi-model workloads, Docker with Koboldcpp enabled full GPU utilization across more than one model, maximizing performance.

Conclusion & Recommendations

The 3×RTX A5000 server is a powerful and versatile solution for running multiple LLM instances concurrently. Proper setup, efficient memory management, and temperature monitoring are key to long-term stability.
For anyone working with GPU-heavy AI workloads, I recommend this server for its high performance and stability.

Why Choose DBM?

Database Mart offers:
• Flexible server replacement policies
• Prompt and human-centered support
• Fair pricing without unnecessary extras
• Direct access to the server without bloated control panels
They have consistently been personable, reliable, and technically supportive — exactly what’s needed for demanding AI workloads.

Deploy Your Own Version of This Use Case Now?

Multi-GPU Dedicated Server - 3xRTX A5000

256GB RAM
GPU: 3 x Quadro RTX A5000
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
1Gbps
OS: Windows / Linux

Single GPU Specifications:
Microarchitecture: Ampere
CUDA Cores: 8192
Tensor Cores: 256
GPU Memory: 24GB GDDR6
FP32 Performance: 27.8 TFLOPS

1mo3mo12mo24mo

$ 539.00/mo

Outline

Customer Story: Running Multiple GGUF Models on Multi-GPU Dedicated Server – 3×RTX A5000