Hot deal! Get up to 53% OFF – As Low As $18.33/Month!>



Performance Analysis: Running LLMs on Ollama with an RTX 3060 Ti GPU Server

In this article, we evaluate the performance of large language models (LLMs) running on Ollama 0.5.4 on a dedicated GPU server equipped with an NVIDIA GeForce RTX 3060 Ti. This setup is commonly sought after by AI enthusiasts and developers seeking an affordable yet powerful solution for hosting AI workloads. With 128GB of RAM and 24-core Xeon processors, this server promises excellent computational power for machine learning benchmarks and RTX 3060 hosting solutions.

Using popular LLMs like Llama 2, Mistral, and Falcon 2, we ran a series of Ollama benchmarks to assess GPU utilization, memory consumption, and inference speeds. If you're looking to understand how the RTX 3060 compares to other GPUs in LLM benchmarking, this review will provide actionable insights.

Hardware Introduction: RTX 3060 Ti Overview

The configuration information of the tested USA server is as follows:

Server Configuration:

CPU: Dual 12-Core E5-2697v2 (24 Cores & 48 Threads)
Memory: 128GB RAM
Storage: 240GB SSD + 2TB SSD
Network: 100Mbps-1Gbps
Operating system: Windows 11 Pro

GPU Details:

GPU：GeForce RTX 3060 Ti
Microarchitecture: Ampere
Compute Capability: 8.6
CUDA cores: 4864
Tensor Cores: 152
vRAM: 8GB GDDR6
FP32 performance: 16.2 TFLOPS

This GPU strikes a balance between cost and performance, making it ideal for AI workloads and gaming benchmarks alike. For LLM hosting, the 8GB VRAM is sufficient for running quantized models (4-bit precision), which drastically reduce memory requirements without significant loss in performance.

Results: LLM Benchmarking on Ollama

Below are the benchmark results obtained when running the models on the Nvidia RTX 3060 Ti GPU:

Models	llama2	llama2	llama3.1	mistral	gemma	gemma2	llava	wizardlm2	qwen2	qwen2.5	stablelm2	falcon2
Parameters	7b	13b	8b	7b	7b	9b	7b	7b	7b	7b	12b	11b
Size(GB)	3.8	7.4	4.9	4.1	5.0	5.4	4.7	4.1	4.4	4.7	7.0	6.4
Quantization	4	4	4	4	4	4	4	4	4	4	4	4
Running on	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4	Ollama0.5.4
Downloading Speed(mb/s)	11	11	11	11	11	11	11	11	11	11	11	11
CPU Rate	2%	27-42%	3%	3%	20%	21%	3%	3%	3%	3%	15%	8
RAM Rate	3%	7%	5%	5%	9%	6%	5%	5%	5%	5%	5%	5%
GPU vRAM	63%	84%	80%	70%	81%	83%	80%	70%	65%	68%	90%	85%
GPU UTL	98%	30-40%	98%	88%	93%	68%	98%	100%	98%	96%	90%	80%
Eval Rate(tokens/s)	73.07	9.25	57.34	71.16	31.95	23.80	72.00	70.79	63.73	58.13	18.73	31.20

Record real-time gpu server resource consumption data:

Screen shoots: Click to enlarge and view

Observations on Nvidia RTX 3060 Ti Server

1. Efficiency of Smaller Models

Models like Llama 2 (7b) and Mistral (7b) showed exceptional performance with high GPU utilization (98%) and fast inference speeds (70+ tokens/s). These models are well-suited for real-time applications, particularly when hosting on RTX 3060 servers.

2. Challenges with Larger Models

Models like StableLM 2 (12b) and Falcon 2 (11b) pushed the limits of the RTX 3060 Ti's 8GB VRAM, leading to slower inference speeds. Llama 2(13b), while requiring more memory, is better suited for GPUs with larger memory, such as the RTX 3090 or 4090.

3. Quantization is Essential

Running all models in 4-bit precision proved crucial for fitting into the RTX 3060 Ti’s memory. Without quantization, these workloads would exceed the GPU's VRAM, leading to slower CPU-based fallback processing.

4. CPU and RAM Usage

CPU and RAM usage remained relatively low for most models, highlighting that the GPU handles the majority of the workload in this setup. This is a strong argument for using the RTX 3060 Ti for Ollama hosting, as it minimizes additional system resource requirements.

Get Started with RTX 3060 Ti Server

While the RTX 3060 Ti performs admirably in this benchmark, it falls short of GPUs with higher VRAM capacity, like the RTX 3090 (24GB) or RTX 4090 (24GB). These GPUs allow for running larger models like 13b-34b. However, for developers prioritizing cost-efficiency, the RTX 3060 Ti strikes a great balance, especially for LLMs under 12b.

Basic GPU Dedicated Server - T1000

$ 99.00/mo

1mo3mo12mo24mo

Order Now

64GB RAM
Eight-Core Xeon E5-2690
120GB + 960GB SSD
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro T1000
Microarchitecture: Turing
CUDA Cores: 896
GPU Memory: 8GB GDDR6
FP32 Performance: 2.5 TFLOPS

Ideal for Light Gaming, Remote Design, Android Emulators, and Entry-Level AI Tasks, etc

Advanced GPU Dedicated Server - RTX 3060 Ti

$ 239.00/mo

1mo3mo12mo24mo

Order Now

128GB RAM
Dual 12-Core E5-2697v2
240GB SSD + 2TB SSD
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 3060 Ti
Microarchitecture: Ampere
CUDA Cores: 4864
Tensor Cores: 152
GPU Memory: 8GB GDDR6
FP32 Performance: 16.2 TFLOPS

Enterprise GPU Dedicated Server - RTX 4090

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: GeForce RTX 4090
Microarchitecture: Ada Lovelace
CUDA Cores: 16,384
Tensor Cores: 512
GPU Memory: 24 GB GDDR6X
FP32 Performance: 82.6 TFLOPS

Enterprise GPU Dedicated Server - RTX A6000

$ 409.00/mo

1mo3mo12mo24mo

Order Now

256GB RAM
Dual 18-Core E5-2697v4
240GB SSD + 2TB NVMe + 8TB SATA
100Mbps-1Gbps

OS: Windows / Linux
GPU: Nvidia Quadro RTX A6000
Microarchitecture: Ampere
CUDA Cores: 10,752
Tensor Cores: 336
GPU Memory: 48GB GDDR6
FP32 Performance: 38.71 TFLOPS

Conclusion: Is RTX 3060 Ti Good for LLM Hosting?

The RTX 3060 Ti proves to be a cost-effective choice for LLM benchmarks, especially when paired with Ollama's efficient quantization. For tasks involving models under 13 billion parameters, this setup offers competitive performance, high efficiency, and low resource consumption. If you're searching for an affordable RTX 3060 hosting solution to run LLMs on Ollama, this GPU delivers solid results without breaking the bank.

Run small and mid-sized large language models (4B-12B parameters).
Developers want to test the reasoning capabilities of LLM with low power consumption.
The budget is limited but high performance GPU Dedicated Server is required.

For larger-scale applications, consider upgrading to GPUs with more VRAM. However, for most developers working with quantized LLMs or seeking a compact RTX 3060 benchmark, this setup remains highly recommended.

Tags:

RTX 3060 benchmark, Ollama benchmark, LLM benchmark, Ollama test, Nvidia RTX 3060 benchmark, Ollama 3060, RTX 3060 Hosting, Ollama RTX Server