Skill

guardrail-review

Review or design the content-safety guardrails of an AI system — input/output classifiers, refusal and safe-completion behavior, escalation/human handoff, and coverage across harm categories, languages, and modalities. Use when assessing or building the safety controls around a model.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-safety:guardrail-review

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

An assessment (or design) of the guardrail stack: what's blocked, how well, where

SKILL.md

50 lines · ~569 tokens

Stats

Parent stars1

MaintenanceGood

Last CommitMay 31, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Goal

An assessment (or design) of the guardrail stack: what's blocked, how well, where the gaps are, and whether it balances under- vs over-blocking.

What to cover

Input-side — moderation/classification of prompts before the model; handling of disallowed and borderline requests; prompt-injection interaction (cross-ref llm-security).
Output-side — moderation of generations before they reach the user or downstream systems; safe-completion vs hard refusal; PII/sensitive-data filters.
Refusal behavior — are refusals correct, consistent, and helpful (offer safe alternatives)? Measure over-refusal of benign requests, not just under-refusal.
Coverage — across all harm categories from harm-modeling, across languages, and across modalities (image/audio/doc — cross-ref multimodal-security). Gaps usually hide in non-English and non-text.
Escalation & oversight — human-in-the-loop for high-stakes/uncertain cases; user reporting; appeal/override paths.
Robustness & monitoring — do guardrails hold under adversarial pressure (safety-red-team)? Is there logging, drift monitoring, and an update process?

Steps

Inventory the existing guardrails (or requirements, if designing).
Assess each area above; for gaps note severity and the harm category exposed.
Check the under-/over-blocking balance with representative benign + unsafe sets.
Recommend concrete improvements and a layered (defense-in-depth) design.

Output

A guardrail review: layer · coverage · gaps · severity · recommendation, plus a target layered design. Validate changes with safety-evaluation and safety-red-team.

Notes

Guardrails are defense-in-depth, not a single classifier — combine input, output, refusal, escalation, and monitoring. The two most common gaps: non-English/ non-text coverage, and over-refusal that quietly breaks legitimate use.

guardrail-review

Popularity

Invocation

Context Preview

SKILL.md

guardrail-review

Popularity

Invocation

Context Preview

SKILL.md

Goal

What to cover

Steps

Output

Notes

Similar Skills

Goal

What to cover

Steps

Output

Notes

Similar Skills