From nerd
Reference for intern training data formats, benchmark structure, and evaluation protocol. Use when running aptitude tests, collecting training data, or evaluating intern performance.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nerd:intern-trainingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Located at `${CLAUDE_PLUGIN_ROOT}/skills/intern-training/benchmark-seed/`.
benchmark-seed/context-extraction/ce-001-auth-middleware.jsonbenchmark-seed/context-extraction/ce-002-cache-eviction.jsonbenchmark-seed/context-extraction/ce-003-rate-limiter.jsonbenchmark-seed/context-extraction/ce-004-retry-backoff.jsonbenchmark-seed/manifest.jsonbenchmark-seed/parameter-detection/pd-001-rust-search.jsonbenchmark-seed/parameter-detection/pd-002-python-retry.jsonbenchmark-seed/parameter-detection/pd-003-ts-cache.jsonbenchmark-seed/parameter-detection/pd-004-go-ratelimit.jsonbenchmark-seed/parameter-detection/pd-005-python-ml.jsonbenchmark-seed/result-classification/rc-001-clear-improvement.jsonbenchmark-seed/result-classification/rc-002-clear-regression.jsonbenchmark-seed/result-classification/rc-003-neutral.jsonbenchmark-seed/result-classification/rc-004-mixed-signals.jsonbenchmark-seed/result-classification/rc-005-subtle-regression.jsonLocated at ${CLAUDE_PLUGIN_ROOT}/skills/intern-training/benchmark-seed/.
benchmark-seed/
├── manifest.json # Version, counts, language coverage
├── parameter-detection/ # 5 examples
│ ├── pd-001-rust-search.json
│ ├── pd-002-python-retry.json
│ ├── pd-003-ts-cache.json # Includes false positives (PI, HTTP_OK)
│ ├── pd-004-go-ratelimit.json
│ └── pd-005-python-ml.json # Includes RANDOM_SEED (not tunable)
├── result-classification/ # 5 examples
│ ├── rc-001-clear-improvement.json
│ ├── rc-002-clear-regression.json
│ ├── rc-003-neutral.json
│ ├── rc-004-mixed-signals.json # Hard: throughput up but latency/memory up
│ └── rc-005-subtle-regression.json # Medium: overfitting pattern
└── context-extraction/ # 4 examples
├── ce-001-auth-middleware.json
├── ce-002-cache-eviction.json
├── ce-003-rate-limiter.json
└── ce-004-retry-backoff.json
Each benchmark example is a JSON file with:
id: Unique identifier (e.g., "pd-001-rust-search")language: Programming language (for parameter-detection and context-extraction)difficulty: easy, medium, or hardcontext_tokens: Approximate token count of inputinput: The input to send to the internexpected_output: The ground truth output to score againstnotes: Optional notes about tricky aspects (false positives to avoid, etc.)Stored at .nerd/intern/training-data/{task_type}.jsonl.
Each line is a JSON object:
{
"task_type": "result-classification",
"input": {"experiment_id": "E001", "results": {}},
"output": {"classification": "improved", "evidence": "..."},
"reasoning": "Claude's chain-of-thought explanation",
"source_agent": "report-compiler",
"created_at": "2026-03-15T10:30:00Z",
"run_id": "run-2026-03-15-001",
"dedup_key": "E001:result-classification"
}
The reasoning field captures Claude's chain-of-thought for knowledge distillation in v2.
npx claudepluginhub shawnroos/shrimpshack --plugin nerdGenerates evaluation test cases for skills by analyzing skill config and metadata. Bootstraps datasets or expands existing ones for /eval-run.
Creates custom LLM evaluation benchmarks using the BYOB decorator framework. Guides through dataset preparation, scorer selection, compilation, and containerization.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.