With 82.6 TFLOPS FP32 performance, 16,384 CUDA cores, and 512 Tensor cores, the NVIDIA RTX 4090 outshines most consumer-grade GPUs in both compute capability and cost-efficiency.
Models | deepseek-r1 | deepseek-r1 | llama2 | llama3.1 | qwen2.5 | qwen2.5 | gemma2 | phi4 | qwq | llava |
---|---|---|---|---|---|---|---|---|---|---|
Parameters | 14b | 32b | 13b | 8b | 14b | 32b | 27b | 14b | 32b | 34b |
Size | 9 | 20 | 7.4 | 4.9 | 9 | 20 | 16 | 9.1 | 20 | 19 |
Quantization | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Running on | Ollama0.5.7 | Ollama0.5.7 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 | Ollama0.5.4 |
Downloading Speed(mb/s) | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
CPU Rate | 2% | 3% | 1% | 2% | 3% | 3% | 2% | 3% | 2% | 2% |
RAM Rate | 3% | 3% | 3% | 3% | 3% | 3% | 3% | 3% | 3% | 3% |
GPU vRAM | 45% | 90% | 41% | 65% | 45% | 90% | 78% | 47% | 90% | 92% |
GPU UTL | 95% | 98% | 92% | 94% | 96% | 97% | 96% | 97% | 99% | 97% |
Eval Rate(tokens/s) | 58.62 | 34.22 | 70.90 | 95.51 | 63.92 | 34.39 | 37.97 | 68.62 | 31.80 | 36.67 |
Metric | Value for Various Models |
---|---|
Downloading Speed | 12 MB/s for all models, 118 MB/s When a 1gbps bandwidth add-on ordered. |
CPU Utilization Rate | Maintain 1-3% |
RAM Utilization Rate | Maintain 2-4% |
GPU vRAM Utilization | 41-92%. The larger the model, the higher the utilization rate. |
GPU Utilization | 92%+. Maintain high utilization rate. |
Evaluation Speed | 30+ tokens/s. It is recommended to use models below 36b. |
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - RTX A6000
Enterprise GPU Dedicated Server - A100
Multi-GPU Dedicated Server- 2xRTX 4090
The NVIDIA RTX 4090 benchmark showcases its exceptional ability to handle LLM inference workloads, particularly for small-to-medium models. Paired with Ollama, this setup provides a robust, cost-effective solution for developers and enterprises seeking high performance without breaking the bank.
However, for larger models and extended scalability, GPUs like the NVIDIA A100 or multi-GPU configurations might be necessary. Nonetheless, for most practical use cases, the RTX 4090 hosting solution is a game-changer in the world of LLM benchmarks.
RTX 4090, Ollama, LLMs, AI benchmarks, GPU performance, NVIDIA GPU, AI hosting, 4-bit quantization, Falcon 40B, LLaMA 2, GPU server, machine learning, RTX 4090 benchmark, Ollama benchmark, NVIDIA GPU, LLM performance, GPU hosting, 4-bit quantization, Falcon 40B, LLaMA 2 benchmark, AI model hosting, GPU server performance