Introduction
By default, Ollama unloads models from memory after 5 minutes of inactivity. This means every new request after a long idle period will reload the model, which can take several seconds and slow down the response. To ensure stable and fast inference, you can configure the OLLAMA_KEEP_ALIVE environment variable so that models stay resident in memory permanently.
This article explains the purpose of the OLLAMA_KEEP_ALIVE environment variable and how to configure it to significantly reduce the initial response time for your inferences with Ollama.
What is OLLAMA_KEEP_ALIVE?
OLLAMA_KEEP_ALIVE is an environment variable that specifies the duration for which a model stays loaded in memory after its last use. By default, models are kept in memory for 5 minutes (5m) before being unloaded. If you frequently interact with a model, keeping it loaded avoids the overhead of reloading it from disk each time.
You can set OLLAMA_KEEP_ALIVE to:
- A time duration (e.g.,
10m,2h,24h) 0to disable keep-alive (models unload immediately)- A negative value (e.g.,
-1) to keep models loaded indefinitely
How to Set OLLAMA_KEEP_ALIVE
1. For a Single Session (Temporary)
If you're running Ollama directly from the command line, you can set the variable for that session:
OLLAMA_KEEP_ALIVE=24h ollama serveThis keeps all models loaded for 24 hours after their last use.
2. System-wide (Persistent, Linux/Windows)
On Linux with Systemd:
If Ollama runs as a systemd service (common in Linux installations), you should configure the environment variable via systemd:
sudo systemctl edit ollama.serviceThen add the following content:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"Reload and restart the service:
sudo systemctl daemon-reexec
sudo systemctl restart ollamaOn Windows with PowerShell:
[Environment]::SetEnvironmentVariable("OLLAMA_KEEP_ALIVE", "-1", "User")
# Restart PowerShell and then start Ollama
ollama serveNow, Ollama will always keep models in memory permanently. This ensures the setting persists across reboots.
3. Docker
If you're using Ollama in a Docker container, pass the variable using the -e flag:
docker run -e OLLAMA_KEEP_ALIVE=24h -d --gpus=all -p 11434:11434 ollama/ollama4. Verify keep-alive status
After starting the service and running a model once, check running models with:
ollama psExpected output example:
NAME ID SIZE PROCESS CONTEXT UNTIL
hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q5_K_M 5990780e3b75 2.1 GB 10%/90% CPU/GPU 4096 ForeverThe column UNTIL: Forever confirms that the model is pinned in memory and will not be unloaded automatically.
Performance Comparison
Tested on Llama-3.2-1B-Instruct-GGUF:Q5_K_M with 50-token generation.
| Scenario | First call latency | Next call latency | Memory behavior |
|---|---|---|---|
| Default (model unloaded) | 8.4s (load + infer) | 8.2s again after timeout | Memory freed after 5 min |
| Default (within 5 min) | 0.7s | – | Model cached temporarily |
Permanent (-1) |
0.7s | 0.7s at any time | Model always in memory |
Setting a longer OLLAMA_KEEP_ALIVE value (e.g., 24h or -1) significantly improves inference response speed for repeated or frequent requests because the model remains in GPU or system memory. However, this also increases memory usage, so choose a duration that balances performance and resource constraints.
Summary
- The
OLLAMA_KEEP_ALIVEenvironment variable applies globally to all models served by the Ollama instance. - The time format follows Go’s
time.ParseDurationsyntax: uses(seconds),m(minutes),h(hours), etc. - This setting affects only the Ollama server; it does not override the
keep_aliveparameter in individual API requests unless explicitly configured to do so.
By tuning OLLAMA_KEEP_ALIVE, you can optimize your Ollama setup for low-latency, interactive applications without unnecessary model reloads.
