Benchmarking LLMs on Ollama with Nvidia GTX 1660 GPU Server

Introduction to GTX 1660 GPU Hosting: The Nvidia GeForce GTX 1660, a mid-tier gaming GPU, is now being employed for running LLMs (Large Language Models) in server environments. With 6GB of GDDR6 memory, 1408 CUDA cores, and a FP32 performance of 5.0 TFLOPS, this GPU is an affordable option for smaller-scale language model inference tasks. Let's dive into the performance analysis of LLMs running on the GTX 1660 GPU server.

Test Server Configuration

Before diving into the Ollama GTX 1660 benchmark, let's take a look at the server specs:

Server Configuration:

  • Price: $159/month
  • CPU: Dual 10-Core Xeon E5-2660v2
  • RAM: 64GB
  • Storage: 120GB + 960GB SSD
  • Network: 100Mbps Unmetered
  • OS: Windows 11 Pro

GPU Details:

  • GPU: Nvidia GeForce GTX 1660
  • Compute Capability: 7.5
  • Microarchitecture: Turing
  • CUDA Cores: 1408
  • Memory: 6GB GDDR6
  • FP32 Performance: 5.0 TFLOPS

This setup makes it a viable option for Nvidia GTX 1660 hosting to run small LLM inference workloads efficiently while keeping costs in check.

Ollama Benchmark: Testing LLMs on GTX 1660 Server

For testing, we utilized Ollama 0.5.11 to benchmark a variety of LLMs on the Nvidia GTX 1660 GPU. The results provide valuable insight into the GPU's performance when tasked with smaller language models.
Modelsdeepseek-r1deepseek-r1deepseek-r1deepseek-coderllama3.2llama3.1codellamamistralgemmacodegemmaqwen2.5qwen2.5
Parameters1.5b7b8b6.7b3b8b7b7b7b7b3b7b
Size(GB)1.14.74.93.82.04.93.84.15.05.01.94.7
Quantization444444444444
Running onOllama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11
Downloading Speed(mb/s)121212121212121212121212
CPU Rate6%20%28%18%6%30%17%4%45%30%6%18%
RAM Rate8%9%10%9%8%10%8%8%12%12%8%9%
GPU UTL38%37%37%42%50%35%42%20%30%36%36%37%
Eval Rate(tokens/s)30.1618.2916.2621.3138.3616.6721.4240.29.699.3926.2818.24
A video to record real-time GTX 1660 GPU Server resource consumption data:
Screen Shoots for Bechmarking LLMs on Ollama with Nvidia GTX1660 GPU Server
ollama run deepseek-r1:1.5bollama run deepseek-r1:7bollama run deepseek-r1:8bollama run deepseek-coder:6.7bollama run llama3.2:3bollama run llama3.1:8bollama run codellama:7bollama run mistral:7bollama run gemma:7bollama run codegemma:7bollama run qwen2.5:3bollama run qwen2.5:7b

Key Findings from the Benchmark

1️⃣. Best for Small Models (7B and below)

The GTX 1660 GPU shines when running small models like DeepSeek-r1 (1.5B) and LLama 2 (7B). These models run smoothly with high GPU utilization and reasonable inference speeds (around 30-40 tokens/s).

2️⃣. CPU Load Increases with Larger Models

For models above 7B, such as DeepSeek-r1 (8B) and LLama 3.1 (8B), CPU utilization increases, signaling that GPU memory (6GB) becomes a bottleneck, limiting performance.

3️⃣. Suboptimal for 8B+ Models

The GTX 1660 struggles with models in the 8B+ range, where it experiences performance drops, and CPU usage increases significantly. It’s clear that larger models are not ideal for this GPU.

4️⃣. Performance Drops for Larger 7B+ Models

Models such as CodeGemma, Gemma, and Mistral (7B) perform significantly better when kept within the 6GB VRAM limit, but as the model size increases, performance drops, particularly when the model exceeds 6GB VRAM.

Get Started with GTX1660 Hosting for Small LLMs

For those deploying LLMs on Ollama, choosing the right NVIDIA GTX1660 hosting solution can significantly impact performance and costs. If you're working with 0.5b-7b model, the GTX1660 is a solid choice for AI inference at an affordable price.

Basic GPU Dedicated Server - GTX 1660

139.00/mo
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Dual 10-Core Xeon E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce GTX 1660
  • Microarchitecture: Turing
  • CUDA Cores: 1408
  • GPU Memory: 6GB GDDR6
  • FP32 Performance: 5.0 TFLOPS
Flash Sale to Mar.16

Professional GPU Dedicated Server - RTX 2060

109.0/mo
45% OFF Recurring (Was $199.00)
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 10-Core E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce RTX 2060
  • Microarchitecture: Ampere
  • CUDA Cores: 1920
  • Tensor Cores: 240
  • GPU Memory: 6GB GDDR6
  • FP32 Performance: 6.5 TFLOPS
  • Powerful for Gaming, OBS Streaming, Video Editing, Android Emulators, 3D Rendering, etc
Flash Sale to Mar.16

Basic GPU Dedicated Server - RTX 4060

106.00/mo
40% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Eight-Core E5-2690
  • 120GB SSD + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce RTX 4060
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 3072
  • Tensor Cores: 96
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 15.11 TFLOPS
  • Ideal for video edting, rendering, android emulators, gaming and light AI tasks.
Flash Sale to Mar.16

Professional GPU VPS - A4000

102.00/mo
43% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 32GB RAM
  • 24 CPU Cores
  • 320GB SSD
  • 300Mbps Unmetered Bandwidth
  • Once per 2 Weeks Backup
  • OS: Linux / Windows 10
  • Dedicated GPU: Quadro RTX A4000
  • CUDA Cores: 6,144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPS
  • Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Conclusion

The Nvidia GTX 1660 GPU is a cost-effective solution for running small-scale LLMs (1.5B - 7B) with good inference speeds (30-40 tokens/s) and low-cost hosting options like $159/month. For larger models, such as 8B and above, consider scaling up to GPUs with more VRAM for optimal performance. This GTX 1660 VPS is excellent for developers working with smaller language models, LLMs inference, and budget-conscious projects.

Tags:

ollama 1660, small llms ollama, ollama GTX1660, Nvidia GTX1660 hosting, benchmark GTX1660, ollama benchmark, GTX1660 for llms inference, nvidia GTX1660 rental, GTX 1660 LLM hosting, Nvidia 1660 performance