From evalview
Runs EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evalview:run-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.
Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.
EvalView compares current agent behavior against saved golden baselines. It runs your test cases, evaluates the outputs, and reports a diff status for each test:
Locate the test directory. Look for tests/evalview/ in the project. If it exists, use that. Otherwise check for a tests/ directory with .yaml test files.
Run a regression check using the run_check MCP tool:
run_check with the detected test_pathtest parameter with the test nameInterpret results:
If changes are intentional, offer to update the baseline by calling run_snapshot with an explanatory notes parameter.
Generate a visual report (optional) by calling generate_visual_report for a detailed HTML breakdown of traces, diffs, scores, and timelines.
evalview check tests/evalview/
evalview check tests/evalview/ --test "my-test"
evalview snapshot tests/evalview/ --notes "updated after prompt refactor"
run_check frequently — it calls the Python API directly with no subprocess overhead.npx claudepluginhub hidai25/eval-viewUse this skill when the user asks to "prevent regressions in AI quality", "regression testing for AI", "how do I know if a prompt change broke something", "before/after evaluation for model changes", "catch quality regressions", or wants to set up a process that catches when a model update, prompt change, or system change has degraded AI output quality compared to before.
Tests and benchmarks LLM agents with behavioral testing, capability assessment, reliability metrics, and production monitoring. Uses AgentBench, τ-bench, ToolEmu, and Langsmith.
Evaluates LLM agents through behavioral testing, capability assessment, reliability metrics, and production monitoring—where top agents score under 50% on real-world benchmarks.