From ai-safety
Design and run a safety evaluation suite for an AI model or feature across harm categories — refusals on disallowed content, robustness, over-refusal vs helpfulness, groundedness/truthfulness — with rubrics and pass/fail thresholds. Use to measure an AI system's safety, establish a baseline, or gate a release.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-safety:safety-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A repeatable safety eval: a structured test set across harm categories, clear
A repeatable safety eval: a structured test set across harm categories, clear grading rubrics, measured pass rates, and a report that supports a release/no-release decision and tracks regression over time.
harm-modeling? (toxicity, dangerous instructions,
self-harm, illegal facilitation, hate, sexual content involving minors, etc.)safety-red-team.)rag-security.)bias-fairness-assessment).harm-modeling.A safety eval report: per-category pass rate vs threshold, notable failures
(redacted), over- vs under-refusal balance, and prioritized fixes. Offer to wire it
into CI via ai-safety-engineer. Use security-diagramming:infographic for a
scorecard and security-reporting for the writeup.
Measure both safety and helpfulness — a model that refuses everything scores "safe" but is useless. Keep the eval set versioned so results are comparable across releases. Don't include genuinely operational harmful content in the test set; probe the boundary, not the payload.
npx claudepluginhub jassics/awesome-claude-security --plugin ai-safetyProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.