Was 30m, which evicts after 30 minutes of inactivity and forces a reload penalty on the next request. Setting -1 holds models in VRAM indefinitely; MAX_LOADED_MODELS=3 caps how many can stay resident simultaneously (vs the previous 2). Tune MAX higher if you're rotating between more than three models AND your GPU has the VRAM for it — comment in the compose explains the trade-off. For the live srvno.de stack: OLLAMA_KEEP_ALIVE=-1 takes effect on the next `docker compose up -d ollama`. Loaded models survive the restart only if they're re-requested before swap-out anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.4 KiB
7.4 KiB