How to Speed Up Ollama Performance



Introduction

By default, Ollama unloads models from memory after 5 minutes of inactivity. This means every new request after a long idle period will reload the model, which can take several seconds and slow down the response. To ensure stable and fast inference, you can configure the OLLAMA_KEEP_ALIVE environment variable so that models stay resident in memory permanently.

This article explains the purpose of the OLLAMA_KEEP_ALIVE environment variable and how to configure it to significantly reduce the initial response time for your inferences with Ollama.

What is OLLAMA_KEEP_ALIVE?

OLLAMA_KEEP_ALIVE is an environment variable that specifies the duration for which a model stays loaded in memory after its last use. By default, models are kept in memory for 5 minutes (5m) before being unloaded. If you frequently interact with a model, keeping it loaded avoids the overhead of reloading it from disk each time.

You can set OLLAMA_KEEP_ALIVE to:

A time duration (e.g., 10m, 2h, 24h)
0 to disable keep-alive (models unload immediately)
A negative value (e.g., -1) to keep models loaded indefinitely

How to Set OLLAMA_KEEP_ALIVE

1. For a Single Session (Temporary)

If you're running Ollama directly from the command line, you can set the variable for that session:

OLLAMA_KEEP_ALIVE=24h ollama serve

This keeps all models loaded for 24 hours after their last use.

2. System-wide (Persistent, Linux/Windows)

On Linux with Systemd:
If Ollama runs as a systemd service (common in Linux installations), you should configure the environment variable via systemd:

sudo systemctl edit ollama.service

Then add the following content:

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

Reload and restart the service:

sudo systemctl daemon-reexec
sudo systemctl restart ollama

On Windows with PowerShell:

[Environment]::SetEnvironmentVariable("OLLAMA_KEEP_ALIVE", "-1", "User")
# Restart PowerShell and then start Ollama
ollama serve

Now, Ollama will always keep models in memory permanently. This ensures the setting persists across reboots.

3. Docker

If you're using Ollama in a Docker container, pass the variable using the -e flag:

docker run -e OLLAMA_KEEP_ALIVE=24h -d --gpus=all -p 11434:11434 ollama/ollama

4. Verify keep-alive status

After starting the service and running a model once, check running models with:

ollama ps

Expected output example:

NAME          ID         SIZE   PROCESS   CONTEXT    UNTIL
hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q5_K_M        5990780e3b75     2.1 GB  10%/90% CPU/GPU  4096 Forever

The column UNTIL: Forever confirms that the model is pinned in memory and will not be unloaded automatically.

Performance Comparison

Tested on Llama-3.2-1B-Instruct-GGUF:Q5_K_M with 50-token generation.

Scenario	First call latency	Next call latency	Memory behavior
Default (model unloaded)	8.4s (load + infer)	8.2s again after timeout	Memory freed after 5 min
Default (within 5 min)	0.7s	–	Model cached temporarily
Permanent (`-1`)	0.7s	0.7s at any time	Model always in memory

Setting a longer OLLAMA_KEEP_ALIVE value (e.g., 24h or -1) significantly improves inference response speed for repeated or frequent requests because the model remains in GPU or system memory. However, this also increases memory usage, so choose a duration that balances performance and resource constraints.

Summary

The OLLAMA_KEEP_ALIVE environment variable applies globally to all models served by the Ollama instance.
The time format follows Go’s time.ParseDuration syntax: use s (seconds), m (minutes), h (hours), etc.
This setting affects only the Ollama server; it does not override the keep_alive parameter in individual API requests unless explicitly configured to do so.

By tuning OLLAMA_KEEP_ALIVE, you can optimize your Ollama setup for low-latency, interactive applications without unnecessary model reloads.

Outline