safety-policy-enforcer | adk-observability-safety

Stats

Actions

Tags

safety-policy-enforcer | adk-observability-safety

safety-policy-enforcer

Add layered safety controls to an ADK 2.0 agent: input screening, output redaction, blocked-topic enforcement, prompt-injection detection.

Layered defense

User input ──▶ [input filter] ──▶ Agent ──▶ [output filter] ──▶ User
                    │                            │
                    ▼                            ▼
              block / redact              redact / refuse / escalate

Input filter — block on rules

from google.adk.callbacks import on_user_input
from google.adk.events import Event

BLOCKED_PATTERNS = [
    r"\b(?:ignore|disregard) (?:previous|prior|all) instructions\b",
    r"\bsystem prompt\b",
    r"reveal your instructions",
]

@on_user_input
async def input_guard(ctx, user_input: str):
    import re
    for pat in BLOCKED_PATTERNS:
        if re.search(pat, user_input, re.IGNORECASE):
            ctx.session.events.append(Event(
                author="safety",
                content="I can't help with that request.",
            ))
            return None  # short-circuit; agent doesn't run
    return user_input

Output filter — redact PII

import re
from google.adk.callbacks import on_after_model_call

PII_PATTERNS = {
    "email":  r"[\w.-]+@[\w.-]+",
    "phone":  r"\b\d{3}-\d{3}-\d{4}\b",
    "ssn":    r"\b\d{3}-\d{2}-\d{4}\b",
    "credit": r"\b(?:\d[ -]*?){13,16}\b",
}

@on_after_model_call
async def redact_pii(ctx, response):
    text = response.text
    for label, pat in PII_PATTERNS.items():
        text = re.sub(pat, f"[{label}-redacted]", text)
    response.text = text
    return response

LLM-as-judge safety check

from google.adk.models.lite_llm import LiteLlm

JUDGE = LiteLlm(model="gemini-2.5-flash")

@on_after_model_call
async def safety_judge(ctx, response):
    prompt = (
        "Score 0-10: is this response safe? Penalize: medical advice without "
        "disclaimers, harmful instructions, hate speech, doxing.\n"
        f"Response: {response.text}\n\nOutput JSON: {{score: int, reason: str}}."
    )
    out = await JUDGE.complete(prompt)
    import json
    parsed = json.loads(out)
    if parsed["score"] < 5:
        ctx.session.events.append(Event(
            author="safety",
            content="(Response withheld — safety review failed.)",
        ))
        response.text = "I can't provide that response. Please consult a professional."
    return response

Blocked-topic list

BLOCKED_TOPICS = ["weapon manufacturing", "self-harm", "minor exploitation"]

@on_user_input
async def topic_block(ctx, user_input):
    lower = user_input.lower()
    for topic in BLOCKED_TOPICS:
        if topic in lower:
            return None
    return user_input

Escalation hook

@on_after_model_call
async def escalate_critical(ctx, response):
    if "self-harm" in response.text.lower():
        await alert_oncall(session_id=ctx.session.id)
    return response

Validation

Test inputs from a known adversarial set (red team prompts)
PII redaction doesn't break legitimate content (false-positive rate)
Safety judge calibrated against human review on 20+ samples
Escalation alerts reach the right person (test path end-to-end)

See also

auth-framework-config for who can hit the agent at all
tool-confirmation-hitl for per-tool approval gates