context-cache-compress | adk-sessions-memory

Stats

Actions

Tags

context-cache-compress | adk-sessions-memory

context-cache-compress

Long sessions hit the model's context window. ADK 2.0 supports two mitigations: prompt caching (Gemini-side) and event compression (ADK-side).

Prompt caching (Gemini)

Cache long static prefixes (system prompt + few-shot examples) so subsequent calls reuse them at lower cost/latency.

from google.adk.agents import LlmAgent

LARGE_INSTRUCTION = open("./few_shot_examples.md").read()  # ~50KB

root_agent = LlmAgent(
    name="cached_agent",
    model="gemini-2.5-flash",
    instruction=LARGE_INSTRUCTION,
    cache_config={
        "cache_instruction": True,
        "cache_ttl_seconds": 3600,
    },
)

ADK creates an explicit Vertex cache resource and reuses it across invocations.

Event compression

Summarize old events when total tokens exceed a threshold:

from google.adk.callbacks import on_before_model_call

@on_before_model_call
async def compress_history(ctx, request):
    if ctx.session.token_count > 100_000:
        # Drop oldest 50% of events, replace with a summary event
        old = ctx.session.events[: len(ctx.session.events) // 2]
        summary = await summarize_events(old)
        ctx.session.events = [summary, *ctx.session.events[len(old):]]
    return request

Sliding window

Cap to last N turns:

@on_before_model_call
async def sliding_window(ctx, request):
    MAX_TURNS = 20
    if len(ctx.session.events) > MAX_TURNS * 2:  # user+assistant pairs
        ctx.session.events = ctx.session.events[-MAX_TURNS * 2:]
    return request

Hierarchical summary

Keep recent verbatim, summarize middle, archive oldest:

@on_before_model_call
async def hierarchical(ctx, request):
    events = ctx.session.events
    if len(events) > 60:
        recent = events[-20:]
        middle_summary = await summarize_events(events[-60:-20])
        archive_summary = ctx.session.state.get("archive_summary", "")
        new_archive = await summarize_events(events[:-60])
        ctx.session.state["archive_summary"] = archive_summary + "\n" + new_archive
        ctx.session.events = [middle_summary, *recent]
    return request

Validation

Token count drops after compression (ctx.session.token_count)
Summary preserves key facts (test with retrieval questions about old turns)
Cache hits visible in Vertex logs / billing
Compression callback runs before model call, not after (avoid losing the response)

See also

session-rewind-checkpoint if you need to revert compressions