Skill

ai-agent-safety-review

**DEFAULT for AI agent safety reviews — dispatches security-auditor + risk-and-controls-reviewer with AI/agent safety focus.**

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ultraprompt:ai-agent-safety-review [system|tool|prompt|focus]

User invocable

Model invocation disabled

Inline context

Default effort

When to use

Manual-only. Invoke for AI/LLM safety review: tool-calling boundaries, prompt injection vectors, retrieval trust, memory/context handling, autonomy controls, or prompt hardening for a specific prompt or skill body.

Argument hint[system|tool|prompt|focus]

Tool Access

This skill is limited to the following tools:

ReadGrepGlobBashWriteEditMultiEditAgent

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Apply discipline per `${CLAUDE_PLUGIN_ROOT}/_shared/DISCIPLINE.md` (covers `$ARGUMENTS` handling, evidence, validation, and safety).

SKILL.md

109 lines · ~1.2k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

AI Agent Safety Review

Apply discipline per ${CLAUDE_PLUGIN_ROOT}/_shared/DISCIPLINE.md (covers $ARGUMENTS handling, evidence, validation, and safety).

Dispatch policy (V8)

Dispatch target: ultraprompt:auditor (focus: ai-safety). See ${CLAUDE_PLUGIN_ROOT}/_shared/DISPATCH-POLICY.md for the full V8 dispatch decision tree, Task call template, and inline-override conditions.

Distinctive judgment

LLM systems have a unique attack surface: prompts and retrieved content are mixed with instructions, and the model treats both as authoritative. Trust boundaries must be explicit (this is data, that is instruction). Tool-calling is privilege escalation: each tool a model can call is an action it can take. Autonomy controls (human-in-loop, confirmation, audit) are part of the design, not afterthoughts.

First signals to inspect

System prompt content: are instructions clearly separated from user data?
Retrieval pipeline: is retrieved content marked as untrusted? Is it shown verbatim to the model?
Tool inventory: what can the model call? With what side effects?
Memory/context: what persists across turns or sessions? Who can poison it?
Autonomy: where does the model act vs ask?
Output channels: where does model output go? (UI, logs, downstream systems, other models)
Prompt injection vectors: any path where attacker-controlled text reaches the model

Failure modes specific to this lane

Concatenating user input directly into the system prompt
Showing retrieved web content to the model without trust boundary marking (classic prompt injection vector)
Tool that takes a free-text argument and executes it (shell, eval, SQL)
Memory shared across users (one user can poison another's session)
Auto-approval of tool calls that should require human confirmation
Logging the full prompt + completion (often contains secrets, PII)
Retrieved context that the model trusts as authoritative when it shouldn't

Workflow

Inventory: prompts, tools, retrieval sources, memory, output channels.
Map trust boundaries. Confirm separation of instruction vs data.
Audit each tool: what does it do, what's the side-effect blast radius, what's the confirmation gate.
Audit retrieval: is content marked, sandboxed, or sanitized before reaching the model?
Audit memory/context for cross-user contamination.
Audit autonomy controls: where can the model act, where must it ask?
Audit output channels for log/audit safety.
Apply hardening: prompt restructuring, tool argument validation, retrieval marking, confirmation gates.

Validation

Run prompt-injection eval cases (test corpus of known injection strings). Test tool argument validation with adversarial inputs. Test memory isolation across simulated users. For prompts: run before/after eval suite if available.

Output contract

Schema below + ${CLAUDE_PLUGIN_ROOT}/_shared/OUTPUT-CONTRACT.md + concise-review style.

schema:
  - field: Scope
    type: section
    required: true
    evidence_rule: "none"
  - field: Trust Boundary Map
    type: section
    required: true
    evidence_rule: "none"
  - field: Tool Audit
    type: section
    required: true
    evidence_rule: "none"
  - field: Retrieval Audit
    type: section
    required: true
    evidence_rule: "none"
  - field: Memory/Context Audit
    type: section
    required: true
    evidence_rule: "none"
  - field: Autonomy Audit
    type: section
    required: true
    evidence_rule: "none"
  - field: Hardenings Applied
    type: section
    required: true
    evidence_rule: "none"
  - field: Remaining Risks
    type: section
    required: true
    evidence_rule: "named risk + likelihood + impact"

Subagent delegation

Dispatch auditor with focus=ai-safety. See _shared/playbooks/prompt-injection-patterns.md and _shared/playbooks/prompt-hardening-checklist.md.

V4 aliases

This skill answers to V4 names: prompt-hardening. The router resolves them to ai-agent-safety-review and notes the alias in its response.

ai-agent-safety-review

Invocation

Tool Access

Context Preview

SKILL.md

ai-agent-safety-review

Invocation

Tool Access

Context Preview

SKILL.md

AI Agent Safety Review

Dispatch policy (V8)

Distinctive judgment

First signals to inspect

Failure modes specific to this lane

Workflow

Validation

Output contract

Subagent delegation

V4 aliases

Similar Skills

AI Agent Safety Review

Dispatch policy (V8)

Distinctive judgment

First signals to inspect

Failure modes specific to this lane

Workflow

Validation

Output contract

Subagent delegation

V4 aliases

Similar Skills