How Much GPU VRAM Do You Need for 7B, 33B, or 70B LLMs?

When deploying large language models (LLMs) such as LLaMA, Mistral, Qwen, DeepSeek, or similar architectures, GPU memory (VRAM) is usually the most important resource to consider. VRAM determines whether your model can load, how fast it can run, and how many concurrent requests you can support.

This article provides practical VRAM guidelines for 7B, 33B, and 70B models under common precision formats used in real-world hosting environments.

Why VRAM Requirements Vary

VRAM usage depends on several factors:

  • Precision (FP16, FP8, INT8, INT4, GPTQ, AWQ, GGUF etc.)
    Lower precision significantly reduces VRAM usage.

  • Serving Framework (vLLM, TensorRT-LLM, Hugging Face Transformers)
    Some runtimes use more KV-cache or pre-allocated memory.

  • Context Length
    Longer context needs more KV-cache VRAM.

  • Batch Size / Concurrent Requests
    Larger batches increase memory usage.

  • Model Architecture
    Mixture-of-Experts vs Dense models will differ.

All VRAM values below assume typical vLLM or Transformers inference with standard batching and context lengths (2K–4K). Actual usage may vary slightly depending on the setup.

VRAM Requirements by Model Size

7B Models

Precision Minimum VRAM Notes
FP16 ~16–20 GB Usually fits on a single A100 40GB, L40, or RTX 4090
FP8 / INT8 ~10–14 GB Good balance of quality and memory
INT4 / GPTQ ~6–8 GB Fits on mid-range GPUs (RTX 3060 12GB, 4060 Ti 16GB)
GGUF (CPU) 0 VRAM VRAM not needed if running CPU-only

Practical Recommendation

If you want smooth 7B inference with moderate batch size, 12–24 GB VRAM is ideal.

33B Models

Precision Minimum VRAM Notes
FP16 ~70–80 GB Requires multi-GPU (2×40GB) or a single 80GB GPU
FP8 / INT8 ~45–60 GB Fits on A100 80GB / H100 / some multi-GPU setups
INT4 / GPTQ ~20–28 GB Can run on a single 24GB–32GB GPU with limits
GGUF (CPU) 0 VRAM CPU inference possible but slower

Practical Recommendation

For stable hosting or API workloads, 48–80 GB VRAM is preferred.
INT4 quantization makes 33B models usable on a single L40S or 4090, but throughput may be limited.

70B Models

Precision Minimum VRAM Notes
FP16 ~140–160 GB Requires 2×80GB or 4×40GB GPUs
FP8 / INT8 ~110–130 GB Still multi-GPU only
INT4 / GPTQ ~40–48 GB Fits on a single 48GB GPU (A6000 / RTX 6000 Ada)
GGUF (CPU) 0 VRAM CPU inference works but is very slow

Practical Recommendation

For production use, 80–160 GB VRAM depending on precision.
INT4 makes it possible to run a 70B model on a single 48GB GPU, though performance is modest.

What About Context Length?

KV-cache grows linearly with context: Doubling context = doubling cache memory.
Typical VRAM overhead for KV-cache:

Model +2K Tokens +4K Tokens +8K Tokens
7B ~1–2 GB ~3 GB ~6 GB
33B ~3–5 GB ~6–8 GB ~12–14 GB
70B ~5–8 GB ~10–12 GB ~20+ GB

If your workload needs 8K–32K context, plan additional VRAM accordingly.

Multi-GPU Considerations

Large models can be split across GPUs using:

  • Tensor Parallelism (TP)
  • Pipeline Parallelism (PP)
  • ZeRO inference
  • vLLM multi-GPU offloading

For example:

  • 70B FP16 on 2×80GB GPUs → Fully supported
  • 33B FP16 on 2×40GB GPUs → Smooth
  • 7B models rarely need multi-GPU unless doing training or large batch serving.

In hosting environments, we usually distribute model weights across GPUs rather than cutting batch sizes.

For 7B Model Hosting (API, Chat, Apps)

  • Ideal: 24–48 GB GPU
  • Minimum: 12GB (INT4)

Good GPU examples: L40S 48GB, RTX 4090, A100 40GB.

For 33B Model Hosting

  • Ideal: 80GB GPU
  • Minimum: 24–32GB (INT4)

Good GPU examples: A100 80GB, H100 80GB, 2×A100 40GB.

For 70B Model Hosting

  • Ideal: 2×80GB or single 80GB FP8/INT8
  • Minimum: 48GB INT4

Good GPU examples: A100 80GB, H100 80GB, RTX 6000 Ada (48GB).

Summary Table

Model Size FP16 INT8 INT4 Practical VRAM Needed
7B 16–20 GB 10–14 GB 6–8 GB 12–24 GB
33B 70–80 GB 45–60 GB 20–28 GB 48–80 GB
70B 140–160 GB 110–130 GB 40–48 GB 80–160 GB

Final Thoughts

Choosing the right VRAM depends on your goals:

  • Light chat or small apps → 7B is enough and runs on mid-range GPUs.
  • More advanced reasoning → 33B models are a good balance.
  • High-end performance similar to GPT-4-class reasoning → 70B with large VRAM or multi-GPU setups.

If your application demands stable inference, consistent latency, and multi-user concurrency, plan for more VRAM than the absolute minimum, especially when using longer contexts or higher batch sizes.

Keywords:

VRAM requirements, 7B model VRAM, 33B model VRAM, 70B model VRAM, GPU memory for LLM, FP16 INT8 INT4, LLM hosting, vLLM memory, model quantization VRAM, deploy LLaMA, AI model VRAM, GPU for inference, VRAM guide

Outline