From hyperagents
Evaluate a specific generation or the current codebase against fitness criteria. Supports staged (quick) and full evaluation modes. Use: /hyperagents:evaluate [--genid <id>] [--domain <domain>] [--staged]
How this command is triggered — by the user, by Claude, or both
Slash command
/hyperagents:evaluateThe summary Claude sees in its command listing — used to decide when to auto-load this command
# HyperAgents Evaluate Command Run fitness evaluation on a generation or the current codebase state. ## Arguments - `--genid <id>`: Evaluate a specific archived generation (default: current working state) - `--domain <domain>`: Evaluation domain to use - `--staged`: Run quick staged evaluation only (smaller sample) - `--full`: Force full evaluation even if staged eval fails - `--samples <n>`: Number of evaluation samples (-1 for all) ## Execution Flow ### 1. Resolve Target If `--genid` is provided: - Load the generation's metadata from `.hyperagents/gen_<id>/metadata.json` - Apply the...
Run fitness evaluation on a generation or the current codebase state.
--genid <id>: Evaluate a specific archived generation (default: current working state)--domain <domain>: Evaluation domain to use--staged: Run quick staged evaluation only (smaller sample)--full: Force full evaluation even if staged eval fails--samples <n>: Number of evaluation samples (-1 for all)If --genid is provided:
.hyperagents/gen_<id>/metadata.jsonIf no --genid:
If --domain is specified, use it. Otherwise:
.hyperagents/config.json for default domainStaged evaluation (default first pass):
Full evaluation (if staged passes or --full flag):
Display:
Write evaluation results to:
.hyperagents/gen_<id>/<domain>_eval/report.json.hyperagents/gen_<id>/<domain>_eval/predictions.csvEach domain must implement:
harness(task_list, agent_path, output_dir) — run the agent on tasksreport(output_dir) — generate score summaryscore_key — the JSON key in report.json containing the fitness scoreBuilt-in domain types:
tests: Run project test suite, score = pass ratelint: Run linters, score = 1 - (issues / baseline_issues)benchmark: Run custom benchmark scriptreview: LLM-as-judge evaluation of code qualitycustom: User-defined evaluation scriptnpx claudepluginhub zpankz/hyperagents-plugin/evaluateRuns three-stage verification pipeline to evaluate execution, checking for correctness, safety, and performance in the current context.
/evaluateEvaluates a developer tool against criteria via research and hands-on testing, scores aspects 1-5, computes weighted total, lists strengths/weaknesses/deal-breakers, and recommends with confidence.
/evaluateEvaluates a specified design, screen, or flow against Nielsen's heuristics, user flows, and accessibility, producing a severity-rated report with prioritized recommendations.
/evaluateReviews project architecture in background via subagent while capturing session learnings, producing architecture report and knowledge artifacts.
/evaluateEvaluates a factor library's out-of-sample IC, ICIR, and decay on the held-out test split, surfacing train-to-test decay with --period both.
/evaluateEvaluates front-end UI code for usability issues using 15 heuristics, rates severity 0-4, and produces a structured report with summary, findings, strengths, and recommendations. Accepts file/dir paths.