Audit the system evaluation report for objectivity, evidence quality, and scoring accuracy. Use after Reflector completes system evaluation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/airesearchorchestrator:audit-system-evaluationThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Independently verify the Reflector's system evaluation report for objectivity, evidence completeness, and scoring accuracy. Detect and correct self-leniency bias.
Independently verify the Reflector's system evaluation report for objectivity, evidence completeness, and scoring accuracy. Detect and correct self-leniency bias.
{"type": "system_eval_ready", "path": "..."} messagedocs/reflection/system-evaluation-report.mdRead the same data sources the Reflector used:
.autoresearch/state/research-state.yamlForm your own independent judgment on each dimension before reading the Reflector's scores.
For each of the 6 dimensions:
Check for systematic patterns:
For each dimension, verify:
If the report includes cross-project trend data:
Send findings to the Reflector:
SendMessage(to="reflector", message={
"type": "eval_audit",
"decision": "approve" | "revise",
"disputes": [
{
"dimension": "<dimension_name>",
"reflector_score": <score>,
"curator_score": <score>,
"rationale": "<why the curator disagrees>"
}
],
"bias_detected": "<none|self_leniency|excessive_criticism>",
"evidence_issues": ["<list of evidence gaps>"],
"arithmetic_errors": ["<list of calculation errors>"]
})
npx claudepluginhub jacazjx/ai-research-orchestrator --plugin airesearchorchestratorCritiques research documents across dimensions like source bias, evidence quality, replicability, priority validation, and completeness using flags and YAML templates.
Self-rates agent output on 5 axes (accuracy, completeness, clarity, actionability, conciseness) with concrete evidence per criterion, producing a structured 1-5 scorecard with improvement suggestions.
Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.