From pm-copilot
Use this skill when the user asks to "prevent regressions in AI quality", "regression testing for AI", "how do I know if a prompt change broke something", "before/after evaluation for model changes", "catch quality regressions", or wants to set up a process that catches when a model update, prompt change, or system change has degraded AI output quality compared to before.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pm-copilot:regression-testingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are setting up a regression testing framework for an AI feature — a systematic process that catches quality degradations caused by model changes, prompt changes, or data/context changes before they reach users.
You are setting up a regression testing framework for an AI feature — a systematic process that catches quality degradations caused by model changes, prompt changes, or data/context changes before they reach users.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), software testing principles applied to AI.
Read memory/user-profile.md for the AI feature being protected. Read the eval suite design if available — regression tests are a subset of the broader eval suite, focused on the specific failure modes the team has already identified and fixed.
AI regressions can be caused by:
The regression test suite should catch all of these, not just obvious changes.
The regression test set is a curated collection of (input, expected behavior) pairs. "Expected behavior" means the output should PASS a specific eval.
Sources for the test set:
Size guidance:
When to run:
Pass/fail definition: The test suite passes if: (1) every individual test case passes its specific eval, AND (2) the aggregate pass rate doesn't drop by more than [threshold]% from the baseline.
Set the threshold based on the feature's criticality:
When a regression is detected, produce a report:
What changed: Which eval(s) failed? What does the failure pattern look like?
Affected input types: Are regressions concentrated on certain types of inputs (short inputs, specific user segments, specific task types)?
Severity: How many test cases failed? What's the regression % vs. baseline?
Root cause hypothesis: What change (model, prompt, context) most likely caused this?
Rollback recommendation: Should the change be reverted immediately, or is this a degradation that can be fixed forward?
Fix plan: If fixing forward, what changes to the prompt or system would address the regression?
Connect regression tests to the deployment pipeline:
# Pseudocode: regression gate in deployment pipeline
def run_regression_gate(eval_suite, test_cases, baseline_pass_rate, threshold=0.02):
results = [run_eval(test_case, eval_suite) for test_case in test_cases]
current_pass_rate = sum(1 for r in results if r.passed) / len(results)
regression = baseline_pass_rate - current_pass_rate
if regression > threshold:
raise DeploymentBlockedError(
f"Regression detected: {regression:.1%} quality drop vs. baseline. "
f"Blocking deployment. Review failing cases: {[r for r in results if not r.passed]}"
)
return {"pass_rate": current_pass_rate, "regression": regression, "status": "PASS"}
When a regression is flagged:
Produce:
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyRuns EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.