From godmode
Guides AI/LLM evaluation workflows: discovers existing setups, designs datasets, selects frameworks (promptfoo, RAGAS, DeepEval), implements LLM-as-judge with rubrics, and configures regression testing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User invokes `/godmode:eval`
/godmode:eval# Find existing eval infrastructure
find . -name "eval*" -o -name "benchmark*" \
-o -name "judge*" | grep -v node_modules
# Check for eval frameworks
grep -l "deepeval\|ragas\|promptfoo\|braintrust" \
package.json pyproject.toml requirements.txt \
2>/dev/null
EVALUATION DISCOVERY:
System: <which AI system to evaluate>
Type: LLM prompt | RAG pipeline | AI agent | model
Trigger: new system | model change | prompt change
Dimensions: correctness, relevance, faithfulness,
safety, format compliance, latency, cost
IF no baseline exists: establish baseline first
IF model changed: run full regression suite
IF prompt changed: run targeted eval on affected dims
DATASET SOURCES:
| Source | Count | Quality |
|-----------------|-------|-----------|
| Golden set | <N> | Highest |
| Production logs | <N> | Realistic |
| Synthetic | <N> | Scalable |
| Adversarial | <N> | Edge cases|
THRESHOLDS:
Minimum golden set: 50 examples
Minimum per category: 10 examples
Adversarial coverage: >= 20% of total set
IF dataset < 50: results not statistically reliable
IF any category < 10: expand before trusting scores
| Framework | Best For |
|------------|-------------------------------|
| RAGAS | RAG: faithfulness, relevance |
| DeepEval | General LLM: 14+ metrics, CI |
| Promptfoo | Prompt testing, comparisons |
| Braintrust | Production evals, experiments |
JUDGE DESIGN:
Judge model: <stronger than system under test>
RULE: Never judge a model with itself.
SCORING RUBRIC:
| Dimension | Scale | Pass Threshold |
|-------------|-------|----------------|
| Correctness | 1-5 | >= 4 |
| Relevance | 1-5 | >= 4 |
| Faithfulness| 1-5 | >= 4 |
| Safety | binary| 100% |
CALIBRATION:
Cohen's kappa vs human ratings: >= 0.7 required
IF kappa < 0.7: refine rubric, add examples
IF judge disagrees > 30%: re-calibrate
REGRESSION SET:
Source: production failures + bug fixes
Size: grows over time, never shrinks
Format: { input, expected, failure_desc, ticket }
PIPELINE:
Trigger: every prompt/model/pipeline change
Load set → Run system → Compare vs gold → Alert
COMPARISON STRATEGIES:
Exact match: structured/JSON outputs
Semantic: cosine similarity > 0.85
LLM judge: acceptable per rubric
Assertions: contains X, not contains Y
IF regression found: block deployment
IF new bug fixed: add to regression set immediately
STATISTICAL ANALYSIS:
| Metric | System A | System B | p-value |
|-------------|----------|----------|---------|
| Correctness | <val> | <val> | <p> |
| Relevance | <val> | <val> | <p> |
| Faithfulness| <val> | <val> | <p> |
| Latency p95 | <ms> | <ms> | <p> |
RULES:
Alpha: 0.05. Multiple correction: Bonferroni.
Minimum sample: 50 per variant.
Report 95% confidence intervals for all metrics.
IF p >= 0.05: no significant difference — do not ship.
IF sample < 30: results unreliable, expand dataset.
EVALUATION COMPLETE:
System: <name>, Dataset: <N> examples
Methods: automated (LLM judge) | human | hybrid
| Metric | Score | Target | Status |
|-------------|-------|--------|-----------|
| Correctness | <val> | >= 4.0 | PASS/FAIL |
| Faithfulness| <val> | >= 4.0 | PASS/FAIL |
| Safety | <val> | 100% | PASS/FAIL |
| Latency p95 | <ms> | <ms> | PASS/FAIL |
Regressions: <N>/<N> passing
Verdict: PASS | FAIL — <N> metrics below target
Commit: "eval: <system> — <N> examples, correctness=<val>, faithfulness=<val>"
Never ask to continue. Loop autonomously until done.
1. Scan: evals/ dir, promptfoo config, deepeval
2. Check for: datasets, judge prompts, baselines
3. Detect: model configs, prompt templates
FOR each quality dimension:
1. Score with calibrated judge
2. Compare to baseline with confidence interval
3. Run significance test (paired bootstrap)
4. IF metric < threshold: flag, refine prompt
5. IF judge disagreement > 30%: re-calibrate
6. IF p >= 0.05: inconclusive, expand dataset
POST-LOOP: Run full regression suite
Print: Eval: {system} — correctness {val}, faithfulness {val}, regressions {N}/{total}. Significance: {yes|no}. Verdict: {verdict}.
timestamp system metric baseline result p_value status
KEEP if: metric meets threshold AND statistically
significant AND judge agreement > 70%
DISCARD if: below threshold OR not significant
OR judge disagreement > 30%
STOP when ANY of:
- All dimensions evaluated with baseline comparison
- Regression tests pass and integrated into CI
- Statistical significance computed for all metrics
- User requests stop
npx claudepluginhub arbazkhan971/godmodeBuilds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.
Evaluates LLM application performance using automated metrics (BLEU, ROUGE, BERTScore), human evaluation, and LLM-as-judge techniques. Helps detect regressions and validate improvements.
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.