From nerd
Canonical delegation protocol for the nerd intern. Reference this when delegating tasks to the local LLM in /nerd or /nerd-this orchestrators. Defines health checks, timeouts, confidence gating, shadow comparison, fallback, and logging.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nerd:intern-delegationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The orchestrator (not individual agents) handles all intern delegation. Agents stay as clean primitives — they do their job. The orchestrator wraps agent calls with an intern-first-or-shadow layer.
The orchestrator (not individual agents) handles all intern delegation. Agents stay as clean primitives — they do their job. The orchestrator wraps agent calls with an intern-first-or-shadow layer.
Run once at the start of every /nerd or /nerd-this run when intern.enabled: true.
The orchestrator resolves the config source (global or project) during pre-flight and passes the values. The health check consumes those values — it does NOT read config files directly.
# These values are passed from the orchestrator's pre-flight resolution:
# INTERN_PROVIDER, INTERN_MODEL, INTERN_ENDPOINT
if [ "$INTERN_PROVIDER" = "ollama" ] || [ -z "$INTERN_PROVIDER" ]; then
# Ollama: use native /api/tags endpoint
HEALTH=$(curl -s -m 5 "http://localhost:11434/api/tags" 2>/dev/null)
# Verify model is loaded
echo "$HEALTH" | python3 -c "import json,sys; d=json.load(sys.stdin); models=[m['name'] for m in d.get('models',[])]; sys.exit(0 if any('${INTERN_MODEL}' in m for m in models) else 1)" 2>/dev/null
else
# Other providers: use OpenAI-compatible endpoint
HEALTH=$(curl -s -m 5 "${INTERN_ENDPOINT}/v1/models" 2>/dev/null)
fi
Pass criteria:
/api/tags model list)qwen3:4b matches qwen3:4b)If health check fails: Disable all delegation for this run. Log: "intern_health_check": "failed" to delegation log. Continue run normally (Claude handles everything).
For each delegatable task during the run, the orchestrator checks intern state from .nerd/intern/state.json:
Read task mode from state.json → intern.tasks.{task_type}.mode
If mode == "live":
→ Call intern, validate, gate on confidence
→ If passes: use result, skip Claude for this task
→ If fails: call Claude, pass intern's failed attempt as context
If mode == "shadow":
→ Call intern in background
→ Call Claude (always, result is authoritative)
→ Compare outputs, log agreement (counts toward promotion)
→ Use Claude's result
If mode == "disabled":
→ Call intern in background (always-shadow)
→ Call Claude (always, result is authoritative)
→ Compare outputs, log as passive observation (does NOT count toward promotion)
→ Use Claude's result
→ Training data still collected
Key difference between shadow and disabled: Both run the intern alongside Claude. Shadow agreements count toward the 20/25 promotion threshold. Disabled observations are logged but don't count — they're passive learning. This lets the intern build training data on tasks it hasn't formally "earned" yet.
Provider-aware calling. Different providers have different APIs. The protocol adapts based on the provider field in config.
Use Ollama's native API (/api/chat), NOT the OpenAI-compatible endpoint. Required because:
"think": false to disable the reasoning field (though models may still reason in content — see Response Parsing below)RESPONSE=$(curl -s -m 180 --connect-timeout 5 \
-H "Content-Type: application/json" \
-d '{
"model": "{intern.model}",
"messages": [
{"role": "system", "content": "{task-specific system prompt}"},
{"role": "user", "content": "{task input}"}
],
"stream": false,
"think": false,
"options": {"temperature": 0, "num_predict": 4096}
}' \
"http://localhost:11434/api/chat")
# Parse: extract content from native response
CONTENT=$(echo "$RESPONSE" | python3 -c "import json,sys; print(json.load(sys.stdin)['message']['content'])")
Use the OpenAI-compatible endpoint:
RESPONSE=$(curl -s -m 180 --connect-timeout 5 \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${NERD_INTERN_API_KEY:-EMPTY}" \
-d '{
"model": "{intern.model}",
"messages": [...],
"temperature": 0,
"max_tokens": 4096
}' \
"{intern.endpoint}/v1/chat/completions")
Small models cannot reliably return pure JSON. Even with think: false and explicit "return ONLY JSON" instructions, models like Qwen3 produce reasoning text with JSON embedded. The delegation protocol MUST:
<think> tags: re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL)"parameters", "classification", "summary") using regex.import re, json
def extract_json(text, expected_key):
text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
try: return json.loads(text)
except: pass
# Find outermost JSON object
start = text.find('{')
if start >= 0:
depth = 0
for i in range(start, len(text)):
if text[i] == '{': depth += 1
elif text[i] == '}': depth -= 1
if depth == 0:
try: return json.loads(text[start:i+1])
except: break
return None
| Layer | Timeout | Purpose |
|---|---|---|
| Connection | 5 seconds | Detect endpoint down (allow model loading) |
| Total request | 180 seconds | Allow cold model loading + thinking + generation |
Why 180s, not 30s: Testing showed 60-180s per call on a 4B model on M1 Pro. First calls are slowest (model loading). The failure budget (3 per run) prevents cascading delays even with generous timeouts.
| Hardware | 4B model | 1B model |
|---|---|---|
| M1 Pro 16GB | 60-180s | 20-60s |
| M2/M3/M4 32GB+ | 20-60s | 10-30s |
| CUDA 16GB+ | 10-30s | 5-15s |
Shadow mode (background) tolerates high latency. Live mode requires <30s per call to be practical — may need a smaller model or faster hardware.
Before using any intern output:
parameters array with name, file, line, value per entryclassification (one of improved/regressed/neutral), evidence stringsummary string (10-500 chars)areas array with file, function, characteristics per entryclassification (one of improved/regressed/neutral), evidence string, metrics objectconfidence field (0.0-1.0)confidence < intern.confidence_threshold (default 0.8), fallbackAny validation failure = confidence 0 = automatic fallback to Claude.
When a task is in shadow or disabled (always-shadow) mode, both the intern and Claude produce output. Compare them:
| Task | Agreement Metric | Agreement Threshold |
|---|---|---|
| parameter-detection | F1 score of detected parameters | F1 >= 0.8 |
| result-classification | Exact match of classification | Exact match |
| context-extraction | Jaccard similarity of key terms | Jaccard >= 0.7 |
| perf-area-mapping | F1 score of identified areas (by file+function) | F1 >= 0.7 |
| perf-classification | Exact match of classification | Exact match |
Rolling window: Track last 25 shadow comparisons per task. Promotion requires 20/25 agreements (not consecutive — tolerates Claude's non-determinism).
Demotion: If accuracy drops below the mode's threshold for 3 consecutive evals, demote one level.
When the intern fails and Claude takes over, pass the intern's attempt as additional context:
"The intern attempted this task and produced: {intern_output}
It failed validation because: {failure_reason}
Please handle this task from scratch, but the intern's attempt may contain useful partial work."
This creates higher-quality training data (Claude correcting specific intern errors) and may improve Claude's response.
Track failures per run. If the intern fails back to Claude more than 3 times in a single run, disable delegation for the remainder of that run. This prevents cascading latency from a misbehaving model.
Persistent failure is handled by the shadow window's demotion criteria — if accuracy drops below threshold, the task demotes automatically. No separate circuit breaker needed.
Append to .nerd/intern/delegation-log.jsonl after each delegation attempt:
{
"run_id": "run-2026-03-15-001",
"task_type": "parameter-detection",
"mode": "live",
"intern_called": true,
"intern_latency_ms": 2340,
"intern_confidence": 0.85,
"validation_passed": true,
"result_used": "intern",
"agreement": null,
"timestamp": "2026-03-15T10:30:00Z"
}
For shadow mode, result_used is always "claude" and agreement is true/false.
After all phases complete, the orchestrator reads the delegation log for this run and updates .nerd/intern/state.json atomically:
The intern is configured globally by default so it shadows across all projects. Per-project overrides are optional.
.claude/nerd.local.md → intern: section~/.claude/plugins/nerd/intern/config.yaml# Resolution logic
if grep -q "intern:" .claude/nerd.local.md 2>/dev/null; then
# Project-level override — use it (may disable intern for this project)
SOURCE="project"
elif [ -f ~/.claude/plugins/nerd/intern/config.yaml ]; then
# Global config — use it
SOURCE="global"
else
# No intern configured
SOURCE="none"
fi
.nerd/intern/state.json (if project config exists)~/.claude/plugins/nerd/intern/state.jsonWhy global state: The intern's competence is about the model, not the codebase. Shadow agreements from project A count toward promotion just as much as agreements from project B. Global state means the intern earns live mode faster across all your work.
Training data is dual-written: Both project-local (.nerd/intern/training-data/) AND global (~/.claude/plugins/nerd/intern/training-data/). The global corpus includes a project field for traceability. This means the intern's aptitude test and auto-eval can draw from all prior research runs across all projects.
~/.claude/plugins/nerd/intern/config.yaml):provider: ollama
model: qwen3:4b
endpoint: http://localhost:11434
confidence_threshold: 0.8
collect_training_data: true
~/.claude/plugins/nerd/intern/state.json):{
"tasks": {
"parameter-detection": {
"mode": "shadow",
"accuracy": 0.96,
"shadow_window": [true, true, false, true, true],
"promoted_at": null
},
"result-classification": { "mode": "shadow", "accuracy": 0.60, "shadow_window": [], "promoted_at": null },
"context-extraction": { "mode": "disabled", "accuracy": 0.46, "shadow_window": [], "promoted_at": null },
"perf-area-mapping": { "mode": "disabled", "accuracy": 0.0, "shadow_window": [], "promoted_at": null },
"perf-classification": { "mode": "disabled", "accuracy": 0.0, "shadow_window": [], "promoted_at": null }
},
"last_run": {
"delegated": 3,
"fallbacks": 1,
"total_intern_time_ms": 7200
},
"lifetime_claude_calls_saved": 0
}
.claude/nerd.local.md):intern:
enabled: false # Disable intern for this project
# Or override specific settings:
# model: qwen3:1b # Use a smaller model for this project
# confidence_threshold: 0.9 # Be more conservative here
State migration: When adding a project-local intern config for the first time, the orchestrator should copy the current global state.json to .nerd/intern/state.json as a starting point. This preserves accumulated shadow history. If .nerd/intern/state.json already exists, do not overwrite it.
intern:
enabled: false
When the intern is configured (global or local), it ALWAYS shadows Claude on research tasks — even if all task modes are disabled. The shadow comparison is free (the local model runs on your hardware, Claude is already running for the research job).
The behavior per mode:
live: Intern goes first. If confident enough, use its result. Otherwise fall back to Claude.shadow: Both run. Use Claude's result. Compare and log agreement.disabled: Both run. Use Claude's result. Compare and log — but don't count toward promotion. This is passive observation that builds training data without affecting promotion thresholds.Why always-shadow for disabled tasks: The intern needs volume to improve. Waiting for a task to be manually promoted to shadow before collecting any data wastes every research run in between. Passive shadowing on disabled tasks builds training data and lets the user see improvement trends in /nerd-intern status before deciding to promote.
The only time the intern doesn't run: When there is no intern configured at all (no global config, no project config), or when the endpoint health check fails at Phase 0.
npx claudepluginhub shawnroos/shrimpshack --plugin nerdProvides protocols, templates, and rules for constructing subagent delegation prompts with context chains, protocol injection, and downstream declarations in agent orchestration.
Orchestrates non-trivial software tasks by verifiability: dispatches parallel background agents for technical work, Claude manages routing, user owns strategy. Experimental memesh protocol.
Orchestrates subagents, task planning, and human-in-the-loop approval in Deep Agents. Covers SubAgentMiddleware, TodoListMiddleware, and HITL interrupts for delegated execution and workflow control.