prompt-cache-keepalive

Your LLM prompt cache has a 5-minute memory. Give it a heartbeat.

Provider-agnostic keepalive for LLM prompt caches — with a cost model that proves the savings instead of asserting them.

The core insight. Autonomous loops should pace sub-270s wakes (or a keepalive touch) during idle. The Anthropic prompt cache has a ~5-minute (300s) TTL, so a wake under ~270s keeps the prefix warm and the next turn reads cache instead of re-creating it. Keepalive is worth it whenever the session's resume probability > 17.4% (see the breakeven).

Documentation site →

pip install prompt-cache-keepalive

The 5-minute problem

Modern LLM APIs let you cache a long conversation prefix: pay full price once, then ~0.1× on every later turn. But the cache has a short idle TTL — Anthropic's is ~5 minutes, sliding. Step away for a coffee, and the cached prefix is evicted on the provider's servers. Your next turn reprocesses the entire prefix at full price.

  without keepalive                 with keepalive
  ─────────────────                 ──────────────
  turn  ─[ cache written ]          turn  ─[ cache written ]
         · idle 6 min                      ♥ touch  (every 270s)
         ✗ evicted                         ♥ touch   prefix stays warm
  resume ████ full reprocess         resume ─[ cache read ]
         ~110,000 tok-equiv                 ~8,800 tok-equiv

A touch is a minimal request that re-uses the cached prefix. A cache read refreshes the TTL — so the prefix never goes cold and your next real turn is cheap.

60-second start

from prompt_cache_keepalive import PromptCacheKeepalive, KeepaliveConfig, TouchResult

# You inject the touch — the library never imports an SDK (see examples/).
def touch() -> TouchResult:
    resp = client.messages.create(
        model="claude-opus-4-8", max_tokens=1,
        system=SYSTEM,                                   # last block: cache_control
        messages=PREFIX + [{"role": "user", "content": "."}],
    )
    u = resp.usage
    return TouchResult(ok=True, cache_read_tokens=u.cache_read_input_tokens,
                       output_tokens=u.output_tokens)

keepalive = PromptCacheKeepalive(touch, KeepaliveConfig(ttl_seconds=300, margin_seconds=30))
keepalive.run(is_active=lambda: session_still_open())    # touches every 270s while active

Full Anthropic wiring → examples/anthropic_keepalive.py.

It proves it pays — and tells you when it doesn't

A touch is not free (~0.1× the prefix). This package ships a cost model so you can see exactly when keepalive wins. Costs are in base-input-token- equivalents (Anthropic: cache-write 1.25×, cache-read 0.10×).

from prompt_cache_keepalive import CacheEconomics, net_savings, breakeven_resume_probability

econ = CacheEconomics(prefix_tokens=88_000)
net_savings(econ, idle_seconds=540, interval_seconds=270, max_touches=12, resume_probability=1.0)
# -> 83_600.0   saved on a 9-minute idle that resumes
breakeven_resume_probability(econ, 540, 270, 12)
# -> 0.174      keepalive wins whenever the session is >17% likely to resume

Idle gap	Touches	Resumes?	Net (88k prefix)
9 min	2	yes	+83,600 tok
9 min	2	no	−17,600 tok (wasted touches)
67 min (past cap)	12	yes	−105,600 tok (honest loss)

The library caps touches (max_touches, default 12) near the breakeven against a single re-write, so a session that goes quiet for an hour can't bleed tokens forever — it stops and lets the cache expire. No silent waste.

"Millions of tokens" — the honest at-scale math

A platform with 10,000 daily coding sessions, ~5 short coffee-break idles each over an 88k cached prefix, mostly resuming:

10,000 × 5 × ~83,600 saved  ≈  4.18 billion token-equivalents / day

Even at 1% addressable, ~42M token-equivalents a day. The win is real because it's bounded and conditional — the cost model is the proof, not the marketing.

Why it's provider-agnostic (and trivially testable)

The core never imports an LLM SDK. You inject a touch_fn; the library owns only the timing and the spend bound:

prompt-cache-cold-tier

Popularity

What's Inside

README

prompt-cache-keepalive

The 5-minute problem

60-second start

It proves it pays — and tells you when it doesn't

"Millions of tokens" — the honest at-scale math

Why it's provider-agnostic (and trivially testable)

Confidence

Similar Plugins

caveman

claude-mem

llm-council-plugin

self-improving-agent

More by CodeTonight-SA

grip-post

grip-session-mesh