From beagle-analysis
Compares 2+ code implementations against a spec using structured multi-dimensional evaluation. Requires a spec file and repo paths.
How this skill is triggered — by the user, by Claude, or both
Slash command
/beagle-analysis:llm-judgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Compare code implementations across multiple repositories using structured evaluation.
Compare code implementations across multiple repositories using structured evaluation.
/beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
| Argument | Required | Description |
|---|---|---|
spec | Yes | Path to spec/requirements document |
repos | Yes | 2+ paths to repositories to compare |
--labels | No | Comma-separated labels (default: directory names) |
--weights | No | Override weights, e.g. functionality:40,security:30 |
--branch | No | Branch to compare against main (default: main) |
$ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.Sequenced workflow: do not start the next phase until the current gate passes. Each pass condition must be checkable (file on disk, non-empty content, or json.load succeeds)—not “I reviewed internally.”
| Gate | Pass condition | Unblocks |
|---|---|---|
| A — Inputs | spec_path is a readable file and non-empty; len(repo_paths) ≥ 2; each path contains .git. | Phase 1 repo agents |
| B — Phase 1 facts | For each repo agent output: stdin/stdout parses as JSON; required keys/shape match references/fact-schema.md. | Phase 2 judge agents |
| C — Phase 2 scores | Five judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for every repo label. | Aggregation |
| D — Report file | .beagle/llm-judge-report.json exists; python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" exits 0. | Markdown summary to the user |
| E — Consistency | Summary table and verdict use the same labels, weights, and per-dimension scores as the JSON report. | Mark task complete |
Parallelism is allowed within a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.
Parse $ARGUMENTS to extract:
spec_path: first positional argumentrepo_paths: remaining positional arguments (must be 2+)labels: from --labels or derived from directory namesweights: from --weights or defaultsbranch: from --branch or mainDefault Weights:
{
"functionality": 30,
"security": 25,
"tests": 20,
"overengineering": 15,
"dead_code": 10
}
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }
for repo in "${REPO_PATHS[@]}"; do
[ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")
Spawn one Task per repo:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/repo-agent.md for detailed instructions
3. Read references/fact-schema.md for the output format
4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis
Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.
Do NOT score or judge. Only gather facts.
Collect all repo outputs into ALL_FACTS.
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
Spawn five judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/judge-agents.md for detailed instructions
3. Read references/scoring-rubrics.md for the $DIMENSION rubric
Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]
weighted_total = sum(
scores[repo_label][dim]['score'] * weights[dim] / 100
for dim in dimensions
)
scores[repo_label]['weighted_total'] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)
Name the winner, explain why they won, and note any close calls or trade-offs.
mkdir -p .beagle
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"
The generated report should include:
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Collect all repo-agent outputs into ALL_FACTS.
After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
.beagle/llm-judge-report.json.Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Before completing (maps to Hard gates D and E):
.beagle/llm-judge-report.json exists and json.load succeeds.weighted_total equals the sum over dimensions of (score × weight / 100) using the configured weights; markdown summary matches the JSON report.npx claudepluginhub existential-birds/beagle --plugin beagle-analysisEvaluates a codebase across 12 pillars using 3 parallel evaluator agents, producing a scored assessment for targeted remediation.
Performs comprehensive multi-agent evaluation of code projects across 12 dimensions like safety, completeness, and design quality. Outputs scored reports with executive summaries and improvement roadmaps in 5-10 minutes.
Runs Agent-Ready Codebase Assessment scoring codebase across 8 dimensions with parallel agents, producing weighted 0-100 score, band rating, and improvement roadmap. Supports Ruby, Python, PHP, TypeScript, JavaScript, Go, Java, Scala, Rust.