Ollama Benchmark: Performance of Running LLMs on Nvidia T1000 GPU Server

With the popularity of large language models (LLMs), more and more developers and researchers want to run these models on local or cloud servers. With P1000's Turing microarchitecture and 8GB of GDDR6 GPU memory, the Quadro T1000 delivers an FP32 performance of 2.5 TFLOPS, making it a viable option for AI workloads. In this article, we will benchmark the performance of various LLMs running on the Ollama platform, leveraging the Nvidia Quadro T1000 GPU.

Hardware Overview: GPU Dedicated Server - T1000

The configuration information of the tested USA server is as follows:

Server Configuration:

  • Price: $99.00/month
  • CPU: Eight-core Xeon E5-2690
  • Memory: 64GB RAM
  • Storage: 120GB + 960GB SSD
  • Network: 100Mbps-1Gbps
  • Operating system: Windows 11 Pro

GPU Details:

  • GPU:Nvidia Quadro T1000
  • Microarchitecture: Turing
  • CUDA cores: 896
  • Video memory: 8GB GDDR6
  • FP32 performance: 2.5 TFLOPS

This configuration has a good price/performance ratio for running Ollama and similar large language models, especially the performance of the graphics card Nvidia Quadro T1000 in computationally intensive tasks is worth exploring.

Tested Several Mainstream LLMs

All models were run through Ollama, using commands ollama run 'model', and were quantized to 4-bit to reduce virtual memory usage.
  • Llama2 (7B)
  • Llama3.1 (8B)
  • Mistral (7B)
  • Gemma (7B, 9B)
  • Llava (7B)
  • WizardLM2 (7B)
  • Qwen (4B, 7B)
  • Nemotron-Mini (4B)

Benchmark Results: Ollama GPU T1000 Performance Metrics

Below are the benchmark results obtained when running the models on the Nvidia T1000 GPU:
Modelsllama2llama3.1mistralgemmagemma2llavawizardlm2qwenqwen2qwen2.5nemotron-mini
Parameters7b8b7b7b9b7b7b4b7b7b4b
Size(GB)3.84.94.15.05.44.74.12.34.44.72.7
Quantization44444444444
Running onOllama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4Ollama0.5.4
Downloading Speed(mb/s)1111111111111111111111
CPU Rate8%8%7%23%20%8%8%6%8%8%9%
RAM Rate8%7%7%9%9%7%7%8%8%8%8%
GPU vRAM63%80%71%81%83%79%70%72%65%65%50%
GPU UTL98%98%97%90%96%98%98%95%96%99%97%
Eval Rate(tokens/s)26.5521.5123.7915.7812.8326.7017.5137.6424.0221.0834.91
Record real-time gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run llama2:7bollama run llama3.1:8bollama run mistral:7bollama run gemma:7bollama run gemma2:9bollama run llava:7bollama run wizardlm2:7bollama run qwen:4bollama run qwen2:7bollama run qwen2.5:7bollama run nemotron-mini:4b

Ollama's Performance Analysis on Nvidia T1000

1. The 8GB video

Memory of the T1000 performs well when running 4-bit quantized models, and can even run larger models (such as Gemma2 with 9B parameters). Compared with consumer-grade graphics cards, the T1000 can allocate video memory more efficiently and reduce overflow problems.

2. Computational Efficiency

The T1000's FP32 performance (2.5 TFLOPS) is suitable for LLM reasoning tasks, but may be slightly insufficient for training models. When running smaller models (such as Qwen 4B), its performance is almost close to full load.

3. Resource Utilization

Ollama's CPU and RAM utilization on the T1000 is very low (7%-9%), which makes the server available for parallel tasks, such as running other services or tools at the same time.

3. Speed Performance

Qwen 4B and Llama2 7B have an evaluation speed of 37.64 tokens/s and 26.55 tokens/s respectively , which are excellent levels among graphics cards of similar price.
MetricValue for Various Models
Downloading Speed11 MB/s for all models, 118 MB/s When a 1gbps bandwidth add-on ordered.
CPU Utilization RateAverage 8%
RAM Utilization RateAverage 7-9%
GPU vRAM Utilization63%-83%
GPU Utilization95%-99%
Evaluation Speed12.83 - 37.64 tokens/s

Comparison with Other Graphics Cards

Compared to higher-end graphics cards (such as RTX 3060 or A100), the performance of T1000 is obviously inferior, but its low power consumption and stability are great advantages. For developers with limited budgets, T1000 is an ideal choice, especially for running Ollama's 4-bit model quantization .

Basic GPU Dedicated Server - T1000

99.00/mo
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Eight-Core Xeon E5-2690
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro T1000
  • Microarchitecture: Turing
  • CUDA Cores: 896
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 2.5 TFLOPS

Basic GPU Dedicated Server - GTX 1660

139.00/mo
1mo3mo12mo24mo
Order Now
  • 64GB RAM
  • Dual 10-Core Xeon E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia GeForce GTX 1660
  • Microarchitecture: Turing
  • CUDA Cores: 1408
  • GPU Memory: 6GB GDDR6
  • FP32 Performance: 5.0 TFLOPS
New Year Sale

Advanced GPU Dedicated Server - RTX 3060 Ti

119.50/mo
50% OFF Recurring (Was $239.00)
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • CUDA Cores: 4864
  • Tensor Cores: 152
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 16.2 TFLOPS

Enterprise GPU Dedicated Server - A100

639.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Summary and Recommendations

This review shows that Nvidia Quadro T1000 is one of the most cost-effective GPUs for running Ollama, especially suitable for the following scenarios:

  • Run small and medium-sized large language models (4B-9B parameters).
  • Developers want to test the reasoning capabilities of LLM with low power consumption.
  • The budget is limited but high performance GPU Dedicated Server is required.
If you are looking for a reliable and affordable graphics card to run Ollama, you might want to try GPU Dedicated Server - T1000 . For more complex training tasks, you may need a higher-end GPU, but in inference tasks, the performance of T1000 is satisfactory enough.
I hope this article helped you better understand the actual performance of the Nvidia T1000 benchmark and Ollama benchmark!
Tags:

Ollama GPU Performance, Ollama benchmark, Nvidia T1000 benchmark, Nvidia Quadro T1000 benchmark, Ollama T1000, GPU Dedicated Server T1000, Ollama test, Llama2 benchmark, Qwen benchmark, T1000 AI performance, T1000 LLM test, Nvidia T1000 AI tasks, running LLMs on T1000, affordable GPU for LLM.