Performance Analysis: Running LLMs on Ollama with an RTX 3060 Ti GPU Server

In this article, we evaluate the performance of large language models (LLMs) running on Ollama 0.5.4 on a dedicated GPU server equipped with an NVIDIA GeForce RTX 3060 Ti. This setup is commonly sought after by AI enthusiasts and developers seeking an affordable yet powerful solution for hosting AI workloads. With 128GB of RAM and 24-core Xeon processors, this server promises excellent computational power for machine learning benchmarks and RTX 3060 hosting solutions.

Using popular LLMs like Llama 2, Mistral, and Falcon 2, we ran a series of Ollama benchmarks to assess GPU utilization, memory consumption, and inference speeds. If you're looking to understand how the RTX 3060 compares to other GPUs in LLM benchmarking, this review will provide actionable insights.

Hardware Introduction: RTX 3060 Ti Overview

The configuration information of the tested USA server is as follows:

Server Configuration:

  • CPU: Dual 12-Core E5-2697v2 (24 Cores & 48 Threads)
  • Memory: 128GB RAM
  • Storage: 240GB SSD + 2TB SSD
  • Network: 100Mbps-1Gbps
  • Operating system: Windows 11 Pro

GPU Details:

  • GPU:GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • Compute Capability: 8.6
  • CUDA cores: 4864
  • Tensor Cores: 152
  • Video memory: 8GB GDDR6
  • FP32 performance: 16.2 TFLOPS

This GPU strikes a balance between cost and performance, making it ideal for AI workloads and gaming benchmarks alike. For LLM hosting, the 8GB VRAM is sufficient for running quantized models (4-bit precision), which drastically reduce memory requirements without significant loss in performance.

Tested Models and Configurations

All models were tested using Ollama 0.5.4 with 4-bit quantization, ensuring efficient utilization of the RTX 3060 Ti's memory and compute resources. We evaluated 12 popular models, including:
  • Llama2 (7B,13b)
  • Llama3.1 (8B)
  • Mistral (7B)
  • Gemma (7B)
  • Gemma2 (9B)
  • Llava (7b)
  • wizardlm2 (7b)
  • Qwen2 (7b)
  • Qwen2.5 (7b)
  • Falcon 2 (11b)
  • StableLM 2 (12b)

Results: LLM Benchmarking on Ollama

Below are the benchmark results obtained when running the models on the Nvidia RTX 3060 Ti GPU:
Modelsllama2llama2llama3.1mistralgemmagemma2llavawizardlm2qwen2qwen2.5stablelm2falcon2
Parameters7b13b8b7b7b9b7b7b7b7b12b11b
Size(GB)3.87.44.94.15.05.44.74.14.44.77.06.4
Quantization444444444444
Running onOllama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4
Downloading Speed(mb/s)111111111111111111111111
CPU Rate2%27-42%3%3%20%21%3%3%3%3%15%8
RAM Rate3%7%5%5%9%6%5%5%5%5%5%5%
GPU vRAM63%84%80%70%81%83%80%70%65%68%90%85%
GPU UTL98%30-40%98%88%93%68%98%100%98%96%90%80%
Eval Rate(tokens/s)73.079.2557.3471.1631.9523.8072.0070.7963.7358.1318.7331.20
Record real-time gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run llama2:7bollama run llama2:13bollama run llama3.1:8bollama run mistral:7bollama run gemma:7bollama run gemma2:9bollama run llava:7bollama run wizardlm2:7bollama run qwen2:7bollama run qwen2.5:7bollama run stablelm2:12bollama run falcon2:11b

Observations on Nvidia RTX 3060 Ti Server

1. Efficiency of Smaller Models

Models like Llama 2 (7b) and Mistral (7b) showed exceptional performance with high GPU utilization (98%) and fast inference speeds (70+ tokens/s). These models are well-suited for real-time applications, particularly when hosting on RTX 3060 servers.

2. Challenges with Larger Models

Models like StableLM 2 (12b) and Falcon 2 (11b) pushed the limits of the RTX 3060 Ti's 8GB VRAM, leading to slower inference speeds. Llama 2(13b), while requiring more memory, is better suited for GPUs with larger memory, such as the RTX 3090 or 4090.

3. Quantization is Essential

Running all models in 4-bit precision proved crucial for fitting into the RTX 3060 Ti’s memory. Without quantization, these workloads would exceed the GPU's VRAM, leading to slower CPU-based fallback processing.

4. CPU and RAM Usage

CPU and RAM usage remained relatively low for most models, highlighting that the GPU handles the majority of the workload in this setup. This is a strong argument for using the RTX 3060 Ti for Ollama hosting, as it minimizes additional system resource requirements.

Comparison to Higher-End GPUs

While the RTX 3060 Ti performs admirably in this benchmark, it falls short of GPUs with higher VRAM capacity, like the RTX 3090 (24GB) or RTX 4090 (24GB). These GPUs allow for running larger models like 13b-34b. However, for developers prioritizing cost-efficiency, the RTX 3060 Ti strikes a great balance, especially for hosting small and mid-sized LLMs.

Basic GPU Dedicated Server - T1000

99.00/mo
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Eight-Core Xeon E5-2690
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro T1000
  • Microarchitecture: Turing
  • CUDA Cores: 896
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 2.5 TFLOPS

Advanced GPU Dedicated Server - RTX 3060 Ti

179.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • CUDA Cores: 4864
  • Tensor Cores: 152
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 16.2 TFLOPS
AI Servers, Smarter Deals!

Enterprise GPU Dedicated Server - RTX 4090

302.00/mo
44% Off Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - RTX A6000

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.

Conclusion: Is RTX 3060 Ti Good for LLM Hosting?

The RTX 3060 Ti proves to be a cost-effective choice for LLM benchmarks, especially when paired with Ollama's efficient quantization. For tasks involving models under 13 billion parameters, this setup offers competitive performance, high efficiency, and low resource consumption. If you're searching for an affordable RTX 3060 hosting solution to run LLMs on Ollama, this GPU delivers solid results without breaking the bank.

  • Run small and mid-sized large language models (4B-12B parameters).
  • Developers want to test the reasoning capabilities of LLM with low power consumption.
  • The budget is limited but high performance GPU Dedicated Server is required.
For larger-scale applications, consider upgrading to GPUs with more VRAM. However, for most developers working with quantized LLMs or seeking a compact RTX 3060 benchmark, this setup remains highly recommended.
Tags:

RTX 3060 benchmark, Ollama benchmark, LLM benchmark, Ollama test, Nvidia RTX 3060 benchmark, Ollama 3060, RTX 3060 Hosting, Ollama RTX Server