From hyperagents
Domain-agnostic fitness evaluation for evolved code generations. Defines evaluation harness interfaces, scoring contracts, and multi-domain aggregation. Triggers when evaluating code quality, running benchmarks, or scoring agent outputs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hyperagents:fitness-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill implements HyperAgents' domain-agnostic evaluation pattern — a pluggable harness system that scores any code generation against configurable fitness criteria.
This skill implements HyperAgents' domain-agnostic evaluation pattern — a pluggable harness system that scores any code generation against configurable fitness criteria.
Every domain evaluation must implement three operations:
Execute the agent on a set of tasks and collect predictions.
Interface:
harness(task_list, agent_path, output_dir, num_samples, num_workers) -> predictions
Output: predictions.csv with columns question_id, prediction
Aggregate predictions into a fitness score.
Interface:
report(output_dir) -> report.json
Output: report.json with at minimum a score key (domain-specific name)
The JSON field name in report.json that contains the primary fitness metric.
tests — Test Suite Fitness# Fitness = test pass rate
score = tests_passed / tests_total
Config in .hyperagents/config.json:
{
"domain": "tests",
"test_command": "npm test -- --json",
"score_key": "pass_rate"
}
lint — Code Quality Fitness# Fitness = reduction in lint issues vs baseline
score = 1 - (current_issues / baseline_issues)
typecheck — Type Safety Fitness# Fitness = reduction in type errors vs baseline
score = 1 - (current_errors / baseline_errors)
benchmark — Performance Fitness# Fitness = custom benchmark metric
score = run_benchmark() / baseline_score
review — LLM-as-Judge FitnessA secondary LLM evaluates the code diff for:
Score = weighted average of these criteria.
composite — Multi-Metric FitnessCombine multiple domain evaluators:
{
"domain": "composite",
"components": [
{"domain": "tests", "weight": 0.5},
{"domain": "lint", "weight": 0.2},
{"domain": "review", "weight": 0.3}
]
}
HyperAgents uses a two-phase evaluation to save compute:
When evolving across multiple domains simultaneously:
aggregate_fitness = mean(score_domain_1, score_domain_2, ..., score_domain_N)
A generation must have valid scores in ALL domains to be a valid parent.
All fitness scores must be in the range [0, 1]:
1 - (value / baseline)When only staged eval was run (not full eval), the score is adjusted:
adjusted_score = raw_score * staged_eval_fraction
Where staged_eval_fraction = staged_samples / full_samples.
This prevents staged-only generations from appearing artificially competitive in parent selection.
npx claudepluginhub zpankz/hyperagents-pluginCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.