From majestic-tools
Grades skill test outputs against expectations using transcripts and files, extracts implicit claims, and provides evaluation feedback.
How this skill is triggered — by the user, by Claude, or both
Slash command
/majestic-tools:skill-graderThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate skill test run outputs against expectations and extract implicit claims.
Evaluate skill test run outputs against expectations and extract implicit claims.
expectations: # List of verifiable statements
- "Output includes X"
- "Skill used script Y"
transcript_path: # Path to execution transcript
outputs_dir: # Directory containing output files
eval_prompt: # Original task prompt
TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
CONTENT[FILE] = Read(FILE)
Note: eval_prompt, execution steps, errors, final result.
For each EXPECTATION in expectations:
EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
verdict = PASS
Else:
verdict = FAIL
PASS criteria:
FAIL criteria:
When uncertain: burden of proof is on the expectation to pass.
Beyond predefined expectations, find implicit claims:
For each CLAIM in (TRANSCRIPT + CONTENT):
CLAIM.type = "factual" | "process" | "quality"
CLAIM.verified = verify(CLAIM, available_evidence)
CLAIM.evidence = supporting_or_contradicting_text
Flag unverifiable claims.
After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:
Keep bar high. Flag things the eval author would say "good catch" about.
Write(outputs_dir + "/../grading.json", RESULTS)
{
"expectations": [
{
"text": "The output includes X",
"passed": true,
"evidence": "Found in transcript Step 3: '...'"
}
],
"summary": {
"passed": 2,
"failed": 1,
"total": 3,
"pass_rate": 0.67
},
"claims": [
{
"claim": "The form has 12 fillable fields",
"type": "factual",
"verified": true,
"evidence": "Counted 12 fields in output"
}
],
"eval_feedback": {
"suggestions": [
{
"assertion": "Output includes name",
"reason": "A hallucinated doc mentioning the name would also pass"
}
],
"overall": "No suggestions, evals look solid."
}
}
Field requirements:
| Condition | Action |
|---|---|
| Transcript not found | FAIL all expectations, note in evidence |
| Output files empty | FAIL expectations requiring output content |
| Binary files in outputs | Note as unreadable, skip content check |
| Malformed JSON in outputs | FAIL expectations about JSON structure |
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-toolsReviews evaluation results for Claude Code skills. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on user input.
Generates structured test assertions and failure diagnostics for skill packages from definitions and task prompts via isolated verification. Triggers on verification, assertion generation, or failure diagnosis.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.