Running LLMs on Ollama: Performance Benchmark on Nvidia H100 GPU Server

With the rise of large language models (LLMs) and AI applications, the need for high-performance GPU hosting has never been greater. The Nvidia H100 GPU, powered by the Hopper architecture, is one of the most powerful GPUs for AI and deep learning workloads. This article benchmarks Ollama's performance on an H100 GPU server, analyzing its ability to handle LLMs efficiently.

Server Specifications

We ran our benchmarks on a dedicated H100 GPU server, featuring:

Server Configuration:

  • Price: $2599.0/month
  • CPU: Dual 18-Core Intel Xeon E5-2697v4
  • RAM: 192GB
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 1Gbps
  • OS: Windows Server 2022

GPU Details:

  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • Compute Capability:9.0
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • Memory: 80GB HBM2e
  • FP32 Performance: 183 TFLOPS

This configuration makes it an ideal H100 hosting solution for deep learning, LLM inference, and AI model training.

Benchmarking Ollama on H100 GPU

We tested various large language models (LLMs) on Ollama 0.5.7, including DeepSeek, Qwen, and LLaMA models, across different parameter sizes.
Modelsdeepseek-r1deepseek-r1deepseek-r1qwenqwenqwenqwen2llama3llama3.1llama3.3zephyrmixtral
Parameters14b32b70b32b72b110b72b70b70b70b141b8x22b
Size9GB20GB43GB18GB41GB63GB41GB40GB43GB43GB80GB80GB
Quantization444444444444
Running onOllama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7Ollama0.5.7
Downloading Speed(mb/s)113113113113113113113113113113113113
CPU Rate5%4%4%4%3%3%4%3%4%3%2%3%
RAM Rate4%3%4%4%4%3%3%3%3%3%4%4%
GPU UTL75%83%92%72%83%90%86%9190%93%83%83%
Eval Rate(tokens/s)75.0245.3624.9448.2328.1720.1928.2826.9425.2024.3438.6238.28
Record real-time H100 gpu server resource consumption data:
Screen shoots: Click to enlarge and view
ollama run deepseek-r1:14bollama run deepseek-r1:32bollama run deepseek-r1:70bollama run qwen:32bollama run qwen:72bollama run qwen:110bollama run qwen2:72bollama run llama3:70bollama run llama3.1:70bollama run llama3.3:70bollama run zephyr:141bollama run mixtral:8x22b

Analysis & Insights

1. CPU Utilization

Across all models, CPU usage remained below 5%, indicating that the H100 offloads almost all computational work to the GPU.

2. RAM Utilization

RAM usage stayed around 3-4%, confirming that most processing happens on the HBM2e memory of the H100 GPU.

3. Ollama's Performance on H100

  • The DeepSeek 14B model had the highest token throughput at 75.02 tokens/s, making it the most efficient for lower-end LLM applications.
  • Larger models like Qwen 110B and LLaMA 3.3 70B saw increased GPU utilization (90-100%), with a corresponding drop in evaluation speed (~20 tokens/s).
  • The H100 GPU handled 70B+ models efficiently, even under high workloads, making it ideal for LLM hosting and AI inference.

H100 vs. A100 for LLMs

  • The H100 significantly outperforms the A100 for Ollama benchmarks and LLM inference, thanks to its higher FLOPS and more advanced tensor cores.
  • If you need to run 70B~110B models, H100 is the better choice, especially for real-time applications.
MetricNvidia H100Nvidia A100 80GB
ArchitectureHopperAmpere
CUDA Cores14,5926,912
Tensor Cores456432
Memory80GB HBM2e80GB HBM2
FP32 TFLOPS18319.5
LLM Performance2x FasterBaseline

H100 GPU Hosting for LLMs

Our dedicated H100 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 80GB of HBM2e memory, it can handle models up to 110B parameters efficiently.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - RTX A6000

384.00/mo
30% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.

Enterprise GPU Dedicated Server - A100

639.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
Flash Sale to Mar.16

Enterprise GPU Dedicated Server - H100

1819.00/mo
30% OFF Recurring (Was $2599.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

Summary and Recommendations

The Nvidia H100 GPU delivers outstanding performance for LLM inference and AI workloads. Running Ollama on an H100 server allows users to efficiently process large-scale AI models, with high throughput and low latency.

For anyone needing LLM hosting, H100 hosting, or high-performance AI computing, our dedicated H100 GPU server is the best choice.

Tags:

Nvidia H100, GPU server, LLM inference, Ollama AI Reasoning, large model, deep learning, GPU cloud computing, H100 vs A100, AI hosting