Why VRAM Requirements Vary
VRAM usage depends on several factors:
Precision (FP16, FP8, INT8, INT4, GPTQ, AWQ, GGUF etc.)
Lower precision significantly reduces VRAM usage.Serving Framework (vLLM, TensorRT-LLM, Hugging Face Transformers)
Some runtimes use more KV-cache or pre-allocated memory.Context Length
Longer context needs more KV-cache VRAM.Batch Size / Concurrent Requests
Larger batches increase memory usage.Model Architecture
Mixture-of-Experts vs Dense models will differ.
All VRAM values below assume typical vLLM or Transformers inference with standard batching and context lengths (2K–4K). Actual usage may vary slightly depending on the setup.
VRAM Requirements by Model Size
7B Models
| Precision | Minimum VRAM | Notes |
|---|---|---|
| FP16 | ~16–20 GB | Usually fits on a single A100 40GB, L40, or RTX 4090 |
| FP8 / INT8 | ~10–14 GB | Good balance of quality and memory |
| INT4 / GPTQ | ~6–8 GB | Fits on mid-range GPUs (RTX 3060 12GB, 4060 Ti 16GB) |
| GGUF (CPU) | 0 VRAM | VRAM not needed if running CPU-only |
Practical Recommendation
If you want smooth 7B inference with moderate batch size, 12–24 GB VRAM is ideal.
33B Models
| Precision | Minimum VRAM | Notes |
|---|---|---|
| FP16 | ~70–80 GB | Requires multi-GPU (2×40GB) or a single 80GB GPU |
| FP8 / INT8 | ~45–60 GB | Fits on A100 80GB / H100 / some multi-GPU setups |
| INT4 / GPTQ | ~20–28 GB | Can run on a single 24GB–32GB GPU with limits |
| GGUF (CPU) | 0 VRAM | CPU inference possible but slower |
Practical Recommendation
For stable hosting or API workloads, 48–80 GB VRAM is preferred.
INT4 quantization makes 33B models usable on a single L40S or 4090, but throughput may be limited.
70B Models
| Precision | Minimum VRAM | Notes |
|---|---|---|
| FP16 | ~140–160 GB | Requires 2×80GB or 4×40GB GPUs |
| FP8 / INT8 | ~110–130 GB | Still multi-GPU only |
| INT4 / GPTQ | ~40–48 GB | Fits on a single 48GB GPU (A6000 / RTX 6000 Ada) |
| GGUF (CPU) | 0 VRAM | CPU inference works but is very slow |
Practical Recommendation
For production use, 80–160 GB VRAM depending on precision.
INT4 makes it possible to run a 70B model on a single 48GB GPU, though performance is modest.
What About Context Length?
KV-cache grows linearly with context: Doubling context = doubling cache memory.
Typical VRAM overhead for KV-cache:
| Model | +2K Tokens | +4K Tokens | +8K Tokens |
|---|---|---|---|
| 7B | ~1–2 GB | ~3 GB | ~6 GB |
| 33B | ~3–5 GB | ~6–8 GB | ~12–14 GB |
| 70B | ~5–8 GB | ~10–12 GB | ~20+ GB |
If your workload needs 8K–32K context, plan additional VRAM accordingly.
Multi-GPU Considerations
Large models can be split across GPUs using:
- Tensor Parallelism (TP)
- Pipeline Parallelism (PP)
- ZeRO inference
- vLLM multi-GPU offloading
For example:
- 70B FP16 on 2×80GB GPUs → Fully supported
- 33B FP16 on 2×40GB GPUs → Smooth
- 7B models rarely need multi-GPU unless doing training or large batch serving.
In hosting environments, we usually distribute model weights across GPUs rather than cutting batch sizes.
Recommended Setups by Model Size
For 7B Model Hosting (API, Chat, Apps)
- Ideal: 24–48 GB GPU
- Minimum: 12GB (INT4)
Good GPU examples: L40S 48GB, RTX 4090, A100 40GB.
For 33B Model Hosting
- Ideal: 80GB GPU
- Minimum: 24–32GB (INT4)
Good GPU examples: A100 80GB, H100 80GB, 2×A100 40GB.
For 70B Model Hosting
- Ideal: 2×80GB or single 80GB FP8/INT8
- Minimum: 48GB INT4
Good GPU examples: A100 80GB, H100 80GB, RTX 6000 Ada (48GB).
Summary Table
| Model Size | FP16 | INT8 | INT4 | Practical VRAM Needed |
|---|---|---|---|---|
| 7B | 16–20 GB | 10–14 GB | 6–8 GB | 12–24 GB |
| 33B | 70–80 GB | 45–60 GB | 20–28 GB | 48–80 GB |
| 70B | 140–160 GB | 110–130 GB | 40–48 GB | 80–160 GB |
Final Thoughts
Choosing the right VRAM depends on your goals:
- Light chat or small apps → 7B is enough and runs on mid-range GPUs.
- More advanced reasoning → 33B models are a good balance.
- High-end performance similar to GPT-4-class reasoning → 70B with large VRAM or multi-GPU setups.
If your application demands stable inference, consistent latency, and multi-user concurrency, plan for more VRAM than the absolute minimum, especially when using longer contexts or higher batch sizes.
VRAM requirements, 7B model VRAM, 33B model VRAM, 70B model VRAM, GPU memory for LLM, FP16 INT8 INT4, LLM hosting, vLLM memory, model quantization VRAM, deploy LLaMA, AI model VRAM, GPU for inference, VRAM guide
