prompt-cache-keepalive
Your LLM prompt cache has a 5-minute memory. Give it a heartbeat.

Provider-agnostic keepalive for LLM prompt caches — with a cost model that
proves the savings instead of asserting them.
The core insight. Autonomous loops should pace sub-270s wakes (or a
keepalive touch) during idle. The Anthropic prompt cache has a ~5-minute
(300s) TTL, so a wake under ~270s keeps the prefix warm and the next
turn reads cache instead of re-creating it. Keepalive is worth it whenever
the session's resume probability > 17.4% (see the breakeven).
Documentation site →
pip install prompt-cache-keepalive
The 5-minute problem
Modern LLM APIs let you cache a long conversation prefix: pay full price
once, then ~0.1× on every later turn. But the cache has a short idle TTL —
Anthropic's is ~5 minutes, sliding. Step away for a coffee, and the cached
prefix is evicted on the provider's servers. Your next turn reprocesses the
entire prefix at full price.
without keepalive with keepalive
───────────────── ──────────────
turn ─[ cache written ] turn ─[ cache written ]
· idle 6 min ♥ touch (every 270s)
✗ evicted ♥ touch prefix stays warm
resume ████ full reprocess resume ─[ cache read ]
~110,000 tok-equiv ~8,800 tok-equiv
A touch is a minimal request that re-uses the cached prefix. A cache read
refreshes the TTL — so the prefix never goes cold and your next real turn is
cheap.
60-second start
from prompt_cache_keepalive import PromptCacheKeepalive, KeepaliveConfig, TouchResult
# You inject the touch — the library never imports an SDK (see examples/).
def touch() -> TouchResult:
resp = client.messages.create(
model="claude-opus-4-8", max_tokens=1,
system=SYSTEM, # last block: cache_control
messages=PREFIX + [{"role": "user", "content": "."}],
)
u = resp.usage
return TouchResult(ok=True, cache_read_tokens=u.cache_read_input_tokens,
output_tokens=u.output_tokens)
keepalive = PromptCacheKeepalive(touch, KeepaliveConfig(ttl_seconds=300, margin_seconds=30))
keepalive.run(is_active=lambda: session_still_open()) # touches every 270s while active
Full Anthropic wiring → examples/anthropic_keepalive.py.
It proves it pays — and tells you when it doesn't
A touch is not free (~0.1× the prefix). This package ships a cost model so
you can see exactly when keepalive wins. Costs are in base-input-token-
equivalents (Anthropic: cache-write 1.25×, cache-read 0.10×).
from prompt_cache_keepalive import CacheEconomics, net_savings, breakeven_resume_probability
econ = CacheEconomics(prefix_tokens=88_000)
net_savings(econ, idle_seconds=540, interval_seconds=270, max_touches=12, resume_probability=1.0)
# -> 83_600.0 saved on a 9-minute idle that resumes
breakeven_resume_probability(econ, 540, 270, 12)
# -> 0.174 keepalive wins whenever the session is >17% likely to resume
| Idle gap | Touches | Resumes? | Net (88k prefix) |
|---|
| 9 min | 2 | yes | +83,600 tok |
| 9 min | 2 | no | −17,600 tok (wasted touches) |
| 67 min (past cap) | 12 | yes | −105,600 tok (honest loss) |
The library caps touches (max_touches, default 12) near the breakeven
against a single re-write, so a session that goes quiet for an hour can't bleed
tokens forever — it stops and lets the cache expire. No silent waste.
"Millions of tokens" — the honest at-scale math
A platform with 10,000 daily coding sessions, ~5 short coffee-break idles each
over an 88k cached prefix, mostly resuming:
10,000 × 5 × ~83,600 saved ≈ 4.18 billion token-equivalents / day
Even at 1% addressable, ~42M token-equivalents a day. The win is real because
it's bounded and conditional — the cost model is the proof, not the marketing.
Why it's provider-agnostic (and trivially testable)
The core never imports an LLM SDK. You inject a touch_fn; the library owns
only the timing and the spend bound: