By hidai25
Run regression testing for AI agents by capturing golden baselines of agent interactions and auto-detecting behavioral regressions after code, prompt, or model changes. Includes watch mode for live scorecard updates and MCP integration with OpenAI and Anthropic APIs.
Generate EvalView test cases — either from a SKILL.md file using LLM-powered generation, or by capturing real agent interactions through a proxy.
Run EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.
Start EvalView watch mode to automatically re-run regression checks whenever project files change.
Requires secrets
Needs API keys or credentials to function
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub hidai25/eval-viewSDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites
A CLI tool for validating AI coding agents
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Chaos-test LLM agents for resilience. Injects latency, errors, timeouts, and malformed responses between your agent and its API to find failure modes before production does.
Trace analysis and context remediation for AI agents
Verify CLAUDE.md/AGENTS.md references, compile typed specs, and test the agent harness