Benchmarking Large Language Models (LLMs) on Ollama with an NVIDIA V100 GPU Server

With the rising demand for LLM inference and AI model deployment, finding the best GPU hosting solution is crucial. The NVIDIA V100 server is a popular choice for LLM reasoning due to its balance of compute power, affordability, and availability. In this benchmark, we test various LLMs on Ollama running on an NVIDIA V100 (16GB) GPU server, analyzing performance metrics such as token evaluation rate, GPU utilization, and resource consumption.

Test Server Configuration

Before diving into the V100 Ollama benchmark, here’s a quick look at our test server specs:

Server Configuration:

  • Price: $229.00/month
  • CPU: Dual 12-Core E5-2690v3 (24 cores & 48 threads)
  • RAM: 128GB
  • Storage: 240GB SSD + 2TB SSD
  • Network: 100Mbps
  • OS: Windows 11 Pro

GPU Details:

  • GPU: NVIDIA V100 16GB
  • Compute Capability 7.0
  • Microarchitecture: Volta
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPS

This NVIDIA V100 hosting setup ensures an optimal balance between cost and performance, making it a great option for AI hosting, deep learning, and LLM deployment.

Benchmarking LLMs on Ollama with NVIDIA V100 Server

Using Ollama 0.5.11, we tested multiple LLMs, including DeepSeek, LLaMA 2, Mistral, and Qwen. Below are the key results of our benchmarking tests.
Modelsdeepseek-r1deepseek-r1deepseek-r1deepseek-coder-v2llama2llama2llama3.1mistralgemma2gemma2qwen2.5qwen2.5
Parameters7b8b14b16b7b13b8b7b9b27b7b14b
Size(GB)4.74.998.93.87.44.94.15.4164.79.0
Quantization444444444444
Running onOllama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11Ollama0.5.11
Downloading Speed(mb/s)111111111111111111111111
CPU Rate2%2%3%3%2%3%3%3%3%42%3%3
RAM Rate5%6%5%5%5%5%5%6%6%7%5%6%
GPU UTL71%78%80%70%85%87%76%84%69%13~24%73%80%
Eval Rate(tokens/s)87.1083.0348.6369.16107.4967.5184.07107.3159.908.3786.0049.38
A video to record real-time V100 gpu server resource consumption data:
Screen Shoots for Bechmarking LLMs on Ollama with Nvidia V100 GPU
ollama run deepseek-r1:7bollama run deepseek-r1:8bollama run deepseek-r1:14bollama run deepseek-coder-v2:16bollama run llama2:7bollama run llama2:13bollama run llama3.1:8bollama run mistral:7bollama run gemma2:9bollama run gemma2:27bollama run qwen2.5:7bollama run qwen2.5:14b

Key Takeaways from the NVIDIA V100 Ollama Benchmark

1. Best LLM for Performance: LLaMA 2 (7B) & Mistral (7B)

  • Both LLaMA 2 (7B) and Mistral (7B) deliver the highest token evaluation rates (107+ tokens/s).
  • Ideal for real-time inference, chatbots, and AI applications requiring fast response times.

2. Medium Models (13-14B) Have Slightly Reduced Performance

  • DeepSeek 14B, Qwen 14B, and LLaMA 2 13B have reduced evaluation rates (~50 tokens/s).
  • Higher GPU utilization (~80-87%) results in increased latency.
  • NVIDIA V100 hosting remains the best choice for 13B-14B models.

3. Not suitable for running large models (27B+)

  • gemma2:27B drops to 8.37 tokens/s, indicating that v100 will not be able to reason models larger than 27B.

4. GPU Utilization & Resource Efficiency

  • GPU utilization varies from 70% to 87%, indicating Ollama on V100 efficiently manages workload.
  • CPU & RAM usage remain low (~2-6%), allowing for potential multi-instance deployments.

Is NVIDIA V100 Hosting a Good Choice for Ollama LLM Inference?

1. Pros of Using NVIDIA V100 for Ollama

  • Affordable & widely available compared to newer A100/H100 servers.
  • Good balance of memory (16GB) and compute power for models up to 7B-24B.
  • Strong inference speed for optimized models like LLaMA 2 and Mistral.
  • Can run multiple smaller models efficiently due to moderate CPU/RAM requirements.

2. Limitations of NVIDIA V100 for Ollama

  • Struggles with larger models (27B+), leading to slower evalu,/liation rates.
  • 16GB VRAM is limiting for multi-GPU scaling.
  • Not ideal for training—only suitable for inference workloads.

Get Started with V100 Server Hosting for LLMs

For those deploying LLMs on Ollama, choosing the right NVIDIA V100 hosting solution can significantly impact performance and costs. If you're working with 7B-24B models, the V100 is a solid choice for AI inference at an affordable price.

Advanced GPU Dedicated Server - V100

229.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2690v3
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia V100
  • Microarchitecture: Volta
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPS
  • Cost-effective for AI, deep learning, data visualization, HPC, etc
Flash Sale to Mar.16

Professional GPU VPS - A4000

102.00/mo
43% OFF Recurring (Was $179.00)
1mo3mo12mo24mo
Order Now
  • 32GB RAM
  • 24 CPU Cores
  • 320GB SSD
  • 300Mbps Unmetered Bandwidth
  • Once per 2 Weeks Backup
  • OS: Linux / Windows 10
  • Dedicated GPU: Quadro RTX A4000
  • CUDA Cores: 6,144
  • Tensor Cores: 192
  • GPU Memory: 16GB GDDR6
  • FP32 Performance: 19.2 TFLOPS
  • Available for Rendering, AI/Deep Learning, Data Science, CAD/CGI/DCC.

Advanced GPU Dedicated Server - RTX 3060 Ti

179.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 3060 Ti
  • Microarchitecture: Ampere
  • CUDA Cores: 4864
  • Tensor Cores: 152
  • GPU Memory: 8GB GDDR6
  • FP32 Performance: 16.2 TFLOPS
Flash Sale to Mar.16

Professional GPU Dedicated Server - P100

129.00/mo
35% Off Recurring (Was $199.00)
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 10-Core E5-2660v2
  • 120GB + 960GB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Tesla P100
  • Microarchitecture: Pascal
  • CUDA Cores: 3584
  • GPU Memory: 16 GB HBM2
  • FP32 Performance: 9.5 TFLOPS
  • Suitable for AI, Data Modeling, High Performance Computing, etc.

Final Verdict: Is V100 Hosting Worth It for Ollama?

For those looking for an affordable LLM hosting solution, NVIDIA V100 rental services offer a cost-effective option for deploying models like LLaMA 2, Mistral, and DeepSeek-R1. With Ollama’s efficient inference engine, the V100 performs well on models up to 7-24B parameters, making it a great choice for chatbots, AI assistants, and other real-time NLP applications.

However, for larger models (24B+), upgrading to an RTX4090(24GB) or A100(40GB) would be necessary. What LLMs are you running on your NVIDIA V100 server? Let us know in the comments!

Tags:

Ollama, LLM, NVIDIA V100, AI, Deep Learning, Mistral, LLaMA2, DeepSeek, GPU, Machine Learning, AI Inference