From brain-os
Analyzes skill outcome logs and user corrections, then proposes self-improvements to skill behavior and description hygiene.
How this skill is triggered — by the user, by Claude, or both
Slash command
/brain-os:improveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
evals/evals.jsonevals/fixture-scope-mismatch.mdevals/fixtures/descriptions/cases.jsonevals/fixtures/descriptions/judge-cases.jsonevals/fixtures/memory/feedback_ambiguous_eval_after_test_write.mdevals/fixtures/memory/feedback_claude_md_vault_english.mdevals/fixtures/memory/feedback_expiry_stale_research_default.mdevals/fixtures/memory/feedback_hook_console_log.mdevals/fixtures/memory/feedback_memory_terminal_pref.mdevals/fixtures/memory/feedback_rules_business_frontmatter.mdevals/fixtures/memory/feedback_skill_handover_verify_state.mdevals/run-description-judge-evals.shevals/run-descriptions-evals.shevals/run-memory-evals.shevals/trace-overread-context.jsonlevals/trace-skipped-grill.jsonlevals/trace-skipped-vault-grep.jsonlevals/trace-vault-first-correct.jsonl/improve <skill-name> # Improve a specific skill (Phase 1–5, from outcome logs + description hygiene)
/improve # Scan all outcome logs, rank by error rate, suggest which to improve
/improve memory # Triage memory feedback inbox via 5-tier rubric (Phase 0 — Step 0.0/0.5/0.7 only)
Description hygiene is integrated into the per-skill flow — Phase 1 detects bloated frontmatter as one signal source, Phase 4 generates a "description trim" mutation strategy alongside the other strategies, and the existing semantic-judge gate verifies trigger preservation. There is no separate /improve descriptions sub-mode; every per-skill run includes the hygiene check.
Three modes:
/improve {skill} after a result != pass outcome log line. No user confirmation./improve memory), then scans logs, picks top candidate, runs full Phase 1–5./improve <skill> ad hoc. /improve memory runs Phase 0 (Step 0.0 + 0.5 + 0.7) only.Six sources, processed in Phase 1 in this order:
corrections=N and interrupt="..." are strongest signalsscripts/scan-skill-descriptions.py. Always runs; no outcome data needed.Full auto. No human-in-loop by default. Safety relies on the Phase 4 eval gate. Manual review mode is NOT supported — the eval gate is the safety net.
Description trim variants follow the same Phase 4 gate as every other variant. The semantic-preservation judge verifies trigger routing via the trigger-preservation gate. No separate carve-out — the existing eval gate is the safety net.
/improve <skill> → full Phase 1–5 on that skill only./improve (no arg) → Phase 1 rank-only: scans all logs, recommends top candidate, stops and reports. Does NOT auto-run Phase 2–5./improve no-arg first to pick skill, then /improve <top> separately.7 required pipe-delimited fields + optional trailing key=value pairs. Parser ignores unknown keys (forward-compat). Never change the 7 required fields or their order — that breaks every prior log line.
/improve memory)Phase 0 runs before Phase 1 in the daily cron and when invoked via /improve memory. Empty memory dir → no-op.
Glob ~/.claude*/projects/*/memory/feedback_*.md across all Claude account dirs, then realpath-dedupe so symlinks between accounts don't double-process the same physical file. Zero feedback files → skip Step 0.0; proceed to Step 0.5.
Walk the rubric per file (top-to-bottom, first match wins):
Deterministic tool-input/output violation? → HOOK (PreToolUse/PostToolUse). Write ~/.claude/hooks/<name>.sh + register in ~/.claude/settings.json. Compliance: hard.
Multi-step workflow or behavioral procedure? → SKILL (with evals). Edit skills/{name}/SKILL.md + add matching evals. Compliance: encoded steps.
Path-scoped rule (only relevant in certain dirs/file types)? → .claude/rules/<topic>.md with paths: frontmatter. Compliance: soft.
Cross-cutting global guidance, long-term? → CLAUDE.md (project or user). Compliance: soft.
Default — none of the above? → MEMORY-STAY. Leave file in place; stamp last_validated: <today>.
Confidence routing: output tier: <tier> + confidence: high → encode immediately. OR tier: ambiguous → file type:human-review issue with a rubric trace + proposal. Treat as ambiguous when two tiers both fit, the rule is too vague, or encoding requires design judgment.
Encoding execution: stage the encoding artifact, then Delete the source memory feedback file (rm <canonical_path>). Commit with improve: encoded memory feedback — <slug> → <tier>. The rationale survives source deletion because the commit body quotes the original rule + Why + How-to-apply.
Failure handling: encoding error → leave source file in place; file type:human-review issue; continue.
Idempotent. Rebuilds MEMORY.md from disk: for each *.md in canonical memory dir (excluding MEMORY.md), read name + description, emit - [<name>](<filename>) — <description>, sorted by filename. Drop entries for files that no longer exist. Replace content while preserving any standing footer note beginning with _Memory feedback is.
Surface stale memory entries. Default EXPIRY_DAYS=90. Files without last_validated: are SKIPPED (lazy migration).
For each *.md: parse last_validated: (ISO). If > EXPIRY_DAYS days old: file type:human-review issue titled Memory expiry: <filename> ({N} days since last_validated). After user closes issue, next run deletes the file.
Telegram alert: post digest at end of each daily run via send_telegram().
When invoked without a skill name: scan all outcome logs, rank by partial + fail rate, recommend top candidate. Stop and report.
When invoked with a skill name:
Read outcome log at {vault}/daily/skill-outcomes/{skill}.log
date | skill | action | source_repo | output_path | commit:hash | result [| key=value ...]key=value fieldspass, partial, or failcorrections=N, args="...", score=N.N, interrupt="..."Read trace files at {vault}/daily/skill-traces/{skill}.jsonl — supplementary; skip if absent.
Scan grill sessions — grep {vault}/daily/grill-sessions/ for skill mentions. Clarifying-question detection: for each match, check whether ? appears within 2 lines. Each detected question = an underspec signal. Meta-case: questions about /improve's own mechanism → extend Clarifications section inline.
Scan aha moments — grep {vault}/thinking/aha/ for skill mentions. Role: supplementary context, NOT a priority hint. Can surface test cases, priority signals, or framing context.
Detect user corrections via git diff — for each log entry with a commit hash, check if output file was modified in a later user commit. Filter cosmetic diffs; keep substantive content changes only.
Frontmatter description bloat — static check on skills/{skill}/SKILL.md via scripts/scan-skill-descriptions.py. Always runs; no outcome data needed. Emits "description is bloated" pattern if flagged.
Six sources total. Weight: direct corrections > interrupts > partial+low-score > trace patterns. Trace patterns must co-occur with an outcome-log signal to justify a SKILL.md edit.
Synthesize signals: what does the skill consistently get wrong? What do corrections share? Weight by corrections=N; use interrupt="..." as explicit failure reason; use score=N.N to distinguish severity.
When trace data exists, run these detectors on the JSONL tool-call sequence. Trace-derived patterns are supplementary — require co-occurrence with an outcome-log signal to justify a SKILL.md edit. Skip if no trace file.
Detector 1 — Skipped vault-first:
IF WebSearch/WebFetch before any Grep/Read targeting vault paths
→ Pattern: makes external calls before grepping vault → must search vault first
→ also: skill searches web before checking vault → should grep vault first per brain-first lookup rule
Detector 2 — Over-read context:
IF ≥10 Read calls in session, OR any Read >500 lines, OR same file read repeatedly
→ Pattern: reads same file repeatedly → should use targeted offset/limit reads
→ also: skill reads excessive context → should read targeted sections only
Detector 3 — Skipped grill:
IF skill SKILL.md defines a grill phase AND no Skill call with "grill" in input appears before Write/Edit
→ Pattern: produces output without grill → must run grill before generating
→ also: skill jumps to editing without reading specs/references → should consult skill references before producing output
Detector 4 — Missing spec consultation:
IF skill has references/ files AND trace shows zero Read calls to any references/ file
→ Pattern: skill ignores its own reference files → should read references/ before generating output
Write each pattern as: Pattern: <what happens> → <what user wants instead>
For each substantive user correction from Phase 1:
args="..." from log; fall back to git log message.Append to evals/evals.json in skill's source directory. Don't overwrite existing evals.
source_repo from outcome log. SKILL.md: {source_repo}/skills/{skill}/SKILL.md.pre-improve-vN in source repo.Generate 3-5 candidate edits addressing Phase 2 patterns via different mutation strategies. Hold in memory as complete SKILL.md text — Do NOT write candidates to disk until winner selected.
Mutation strategies:
| Strategy | What it does |
|---|---|
| Narrow rule | Add one specific rule targeting the exact failure pattern |
| Broad principle | Add a general principle covering this and adjacent failures |
| Gotcha addition | Add to Gotchas section as warning/anti-pattern |
| Example-based | Add concrete before/after example |
| Restructure | Move/merge existing rules to better cover the gap |
| Description trim | Replace frontmatter description with tighter version ≤ 300 chars preserving routing keywords. Verified by semantic-judge (Gate 2). Body unchanged — frontmatter only. |
Each candidate must use a meaningfully different strategy. Minimum 3 candidates required for behavioral patterns. Description-trim variants are the exception: when the only signal is description bloat (no corrections, no traces, no grill questions), Phase 4 generates a single description-trim candidate and runs it through the gates.
Trim guidance for the Description trim strategy:
'…'), double-quoted ("…"), and backtick-quoted (`…`) token from the pre-edit description. Verify each appears verbatim in your candidate before writing — the semantic-judge enforces preservation of all routing keywords by returning FAIL on any missing token.evals/evals.json, check each expectation string against candidate text (text-search). Record pass count.byte-size(candidate). Disqualify any variant exceeding 20480 bytes.Gate 2 — Semantic preservation (LLM judge)
Short-circuit (deterministic): if candidate's full frontmatter block is byte-identical to pre-edit frontmatter → mark as PASS without spawning Agent. Record as Semantic: PASS (short-circuit — frontmatter unchanged).
Full judge (when frontmatter changed): spawn Agent (subagent_type: general-purpose, model: sonnet):
You are judging whether a proposed SKILL.md edit still matches its original trigger conditions. The pre-edit frontmatter
descriptionis the canonical spec.Routing-surface preservation rule (non-negotiable): Extract every single-quoted phrase (
'…'), double-quoted phrase ("…"), and backtick-quoted token (`…`) from the pre-edit description. These — and ONLY these — are the routing surface. If ANY token is missing from the candidate description (verbatim substring match), return FAIL. Unquoted prose, decorative adjectives, and explanatory clauses can be paraphrased, truncated, or dropped freely. Semantic equivalence does not save a missing quoted token.Also FAIL if candidate drifts into different skill territory.
Reply:
PASSorFAILon first line, one-sentence rationale on second. No other output.Pre-edit frontmatter description:
{pre_edit_description}Candidate SKILL.md:
{candidate_full_text}
Output handling: first line must be exactly PASS or FAIL. Anything else → treat as FAIL. Conservative default protects skill routing.
First candidate returning PASS → select as winner. On FAIL, drop and re-judge next-ranked.
improve: {skill} — {one-line summary}.git checkout -- skills/{skill}/SKILL.md. Report which gate rejected which variant. Same revert protocol regardless of which gate tripped.Write to {vault}/daily/improve-reports/{date}-{skill}.md:
# Improve Report: {skill} — {date}
## Signals analyzed
- {N} outcome log entries ({X} partial, {Y} fail, {Z} pass)
- {N} grill sessions mentioning {skill}
- {N} aha moments mentioning {skill}
- {N} user corrections detected
## Patterns found
- Pattern: {description} ({N} occurrences)
## Changes made
- {what was added/modified in SKILL.md}
## Variant evaluation
| Variant | Strategy | Eval pass | Diff lines | Patterns covered | Size (bytes) | Size OK? | Semantic | Selected |
|---------|----------|-----------|------------|------------------|--------------|----------|----------|----------|
| V1 | {strategy} | {X}/{Y} | {N} | {N}/{total} | {bytes} | {yes/no} | {PASS/FAIL + rationale or short-circuit} | {yes/no} |
## Eval results
- Before: {X}/{Y} pass ({percent}%)
- After (winner): {X}/{Y} pass ({percent}%)
- Candidates tested: {N}
- Decision: KEPT (V{N}) / REVERTED (all failed)
## Auto-generated evals
- {N} new eval cases added from user corrections
## Memory triage (Phase 0)
| Slug | Tier | Confidence | Encoding artifact | Source deleted? | Commit |
|------|------|------------|-------------------|-----------------|--------|
Commit and push the report to the vault.
~/.claude/plugins/). Always edit the source repo. The source_repo field in outcome log points to the right place./improve's own mechanism (cadence/signals/auto-apply/scope/log fields), that's a signal the Clarifications section is underspecified. Patch the answer inline in SKILL.md; don't route to a reference. Every question answered inline = one future user doesn't need to ask.evals/evals.json for coverage gaps. Add missing evals in the SAME commit as the SKILL.md change — don't split. pre-commit-eval-gap.sh enforces this at repo level.Append to {vault}/daily/skill-outcomes/improve.log:
{date} | improve | {action} | ~/work/brain-os-plugin | {output_path} | commit:{hash} | {result}
action: improve (Phase 1–5 single-skill), memory (Phase 0 only — triage + index reconcile + expiry), rank (no-arg rank-only)output_path: daily/improve-reports/{date}-{skill}.md for improve/rank; N/A for memoryresult: pass if evals improved and changes kept, OR ≥1 memory file processed cleanly, OR no new patterns found (score=100%, variants=0) — no-new-patterns supersedes size-gate-blocked; partial if some variants reverted or some triages ambiguous; fail if all reverted or no input data. commit:pending does not change result — reflects improvement quality, not commit status.args="{skill-name}", score={after_pass}/{after_total}, desc_trim_applied=1, triaged={N}, encoded={N}, deleted={N}, ambiguous={N}, expired={N}npx claudepluginhub sonthanh/brain-os-pluginAnalyzes skill executions from conversation friction, file diffs, user feedback, diagnostics, and lessons to propose concrete improvements to SKILL.md files for efficiency.
Captures high/medium/low confidence patterns from conversations to prevent repeating mistakes and preserve successes. Invoke proactively after corrections, praise, edge cases, or skill-heavy sessions.
Captures high/medium/low confidence learnings from conversations via triggers like corrections, praise, edge cases. Improves skills by preventing mistakes and preserving successes. Invoke proactively after 'no/wrong', 'perfect', or session ends.