Skill

airesearchorchestrator:audit-system-evaluation

Audit the system evaluation report for objectivity, evidence quality, and scoring accuracy. Use after Reflector completes system evaluation.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/airesearchorchestrator:audit-system-evaluation

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteEditGrepGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Independently verify the Reflector's system evaluation report for objectivity, evidence completeness, and scoring accuracy. Detect and correct self-leniency bias.

SKILL.md

101 lines · ~851 tokens

Stats

LanguagePython

Stars1

MaintenanceExcellent

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Audit System Evaluation

Purpose

Independently verify the Reflector's system evaluation report for objectivity, evidence completeness, and scoring accuracy. Detect and correct self-leniency bias.

Prerequisites

Reflector has sent {"type": "system_eval_ready", "path": "..."} message
System evaluation report exists at docs/reflection/system-evaluation-report.md

Workflow

Step 1: Independent Evidence Review

Read the same data sources the Reflector used:

.autoresearch/state/research-state.yaml
All phase scorecards
Core deliverables per phase

Form your own independent judgment on each dimension before reading the Reflector's scores.

Step 2: Score Comparison

For each of the 6 dimensions:

Record your independent score
Read the Reflector's score and evidence
Calculate the deviation (Reflector score - your score)
Flag dimensions with deviation ≥ 1 point as "disputed"

Step 3: Bias Detection

Check for systematic patterns:

Count how many dimensions the Reflector scored higher than your assessment
If ≥ 3 dimensions are scored ≥ 1 point higher → flag "systematic self-leniency"
If ≥ 3 dimensions are scored ≥ 1 point lower → flag "excessive self-criticism"

Step 4: Evidence Audit

For each dimension, verify:

Score references specific state data or deliverable content
No vague "as observed" statements without concrete evidence
Quantitative metrics match actual state data
Diagnosis identifies root causes, not just symptoms
Recommendations are specific and actionable

Step 5: Arithmetic Verification

Verify weighted total calculation is correct
Verify recommendation maps correctly to the total score

Step 6: Registry Consistency Check

If the report includes cross-project trend data:

Verify historical data matches the global registry
Flag any inconsistencies

Step 7: Issue Audit Report

Send findings to the Reflector:

SendMessage(to="reflector", message={
    "type": "eval_audit",
    "decision": "approve" | "revise",
    "disputes": [
        {
            "dimension": "<dimension_name>",
            "reflector_score": <score>,
            "curator_score": <score>,
            "rationale": "<why the curator disagrees>"
        }
    ],
    "bias_detected": "<none|self_leniency|excessive_criticism>",
    "evidence_issues": ["<list of evidence gaps>"],
    "arithmetic_errors": ["<list of calculation errors>"]
})

Decision Criteria

APPROVE: All scores within ±1 of independent assessment, evidence complete, arithmetic correct
REVISE: Any disputed dimensions, evidence gaps, or calculation errors exist

Hard Rules

Independent first — Form your own scores BEFORE reading the Reflector's report
Evidence-based disagreement — Every disputed score must cite specific counter-evidence
No rubber-stamping — Even if scores seem reasonable, verify the evidence trail
Arithmetic matters — Always recalculate the weighted total independently
Do not modify the report — Send findings via message; the Reflector makes corrections

airesearchorchestrator:audit-system-evaluation

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

airesearchorchestrator:audit-system-evaluation

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Audit System Evaluation

Purpose

Prerequisites

Workflow

Step 1: Independent Evidence Review

Step 2: Score Comparison

Step 3: Bias Detection

Step 4: Evidence Audit

Step 5: Arithmetic Verification

Step 6: Registry Consistency Check

Step 7: Issue Audit Report

Decision Criteria

Hard Rules

Similar Skills

Audit System Evaluation

Purpose

Prerequisites

Workflow

Step 1: Independent Evidence Review

Step 2: Score Comparison

Step 3: Bias Detection

Step 4: Evidence Audit

Step 5: Arithmetic Verification

Step 6: Registry Consistency Check

Step 7: Issue Audit Report

Decision Criteria

Hard Rules

Similar Skills