From learning-system
Use to autonomously improve skills, memory files, agents, documentation, or conversation-derived workflows based on recent evidence. Runs eval-loop style review and mutation, keeps score and change logs, preserves reusable lessons, and chooses the highest-leverage improvement targets without waiting for explicit user instruction.
How this skill is triggered — by the user, by Claude, or both
Slash command
/learning-system:auto-improveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Most skills, agents, and documentation work about 70% of the time. The other 30% produces inconsistent, shallow, or wrong output. The fix is not a full rewrite — it is letting an autonomous loop run the target repeatedly, score every output against binary criteria, tighten the prompt until that 30% disappears, and keep a complete research log of every mutation attempted.
Most skills, agents, and documentation work about 70% of the time. The other 30% produces inconsistent, shallow, or wrong output. The fix is not a full rewrite — it is letting an autonomous loop run the target repeatedly, score every output against binary criteria, tighten the prompt until that 30% disappears, and keep a complete research log of every mutation attempted.
Memories are different: they degrade silently. Facts go stale, gaps accumulate, entries duplicate. The fix is a structured audit followed by targeted rewrites.
This skill handles both patterns under one entry point.
It is not request-routed. The trigger is what the agents actually did: files changed, mistakes repeated, user directions clarified, workflows that felt awkward, docs that were missing, and gaps between expected behavior and actual behavior.
It should behave like a lightweight hyperagent, not a one-shot optimizer:
STOP. Do not touch any file until you have completed this discovery pass:
Inspect the latest evidence from the work that just happened:
Build a candidate improvement list across these target types:
skill — a SKILL.md file whose instructions caused weak or awkward executionagent — an agent definition under agents/ whose routing, trigger text, or workflow was offdocumentation — repo docs like AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, DESIGN.md, README.md, ARCHITECTURE.md, TESTS.md, SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md, OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md, CONTRIBUTING.md, TESTING.md, writing-style-guide.md, logs/, lessons/, items/, fixes/, audits/, raw/, plans/, specs/, sources/, lib/, references/, cookbook/, knowledge/, runbooks/, research/, official-documentation/, context/, or domain-specific AFS doc treesmemory — memory files under ~/.claude/projects/*/memory/ that are stale, contradictory, or missing key durable factsconversation — harvest durable memory or reusable workflows from the current conversationRank candidates by:
Choose the smallest set of targets that fixes the real problem:
AGENTS.md, PLAN.md, SPEC.md, DESIGN.md, or a runbook.writing-style-guide.md plus the consuming skill, not just more drafting rules.For each selected target that is a skill, agent, or documentation file, determine its source origin:
alvarovillalbaa/agent-suite working directory (i.e. its absolute path starts with the agent-suite repo root, typically /Users/alvipe/Desktop/agent-suite/skills/). Improvements must be proposed as a GitHub PR to alvarovillalbaa/agent-suite, not saved directly. The local file is still mutated during the eval loop (the iteration speed benefit must be preserved), but the final write-back and commit/push/PR happen only after the loop ends.~/.claude/plugins/cache/ or another installed plugin path). Improve it locally, no PR needed.Mark each selected target with its origin before starting any sub-flow.
Route each selected target to the appropriate sub-flow below. Process multiple targets one at a time, highest leverage first.
If the evidence shows repeated voice mismatch, brand-tone drift, or user feedback like "too AI", "too salesy", or "this doesn't sound like me/us", route through sub-flow: writing-style capture and refresh before or alongside the eval loop.
If the evidence does not justify any durable improvement, stop and make no changes.
Find all memory files:
~/.claude/projects/*/memory/MEMORY.md~/.claude/projects/*/memory/*.md (exclude MEMORY.md itself)Read every file in full before forming any judgment.
For every memory file, check:
Staleness — Does the file contain relative dates ("last Thursday", "recently", "this quarter") that have no absolute anchor? Does it describe a state (a role, a project, a decision) that may no longer be true? Flag these for verification or removal.
Gaps — Is there a category of user knowledge (role, preferences, project context, key decisions) that the memory system clearly should have but doesn't? Note what is missing and why it matters.
Redundancy — Do two or more files encode the same fact? Is any file a strict subset of another? Mark duplicates for consolidation.
Inconsistencies — Do any two files contradict each other? Does a file's type field mismatch its content? Does the MEMORY.md index reference a file that no longer exists, or omit one that does? Flag every conflict.
Create auto-improve-memory/audit-report.md with this structure:
# Memory Audit Report — YYYY-MM-DD
## Summary
- Files reviewed: N
- Issues found: N (staleness: N, gaps: N, redundancy: N, inconsistencies: N)
## Staleness
- [filename]: [what is stale and why]
## Gaps
- [what is missing]: [why it matters]
## Redundancy
- [filename A] duplicates [filename B]: [what overlaps]
## Inconsistencies
- [filename A] vs [filename B]: [what conflicts]
## Recommended actions
1. [action] — [file]
2. ...
For each issue found, take the recommended action:
Rewrite one file at a time. After each rewrite, update the MEMORY.md index if needed.
Create auto-improve-memory/changelog.md:
## [filename] — [action taken]
**Issue:** [what was wrong]
**Change:** [what was rewritten or added]
**Reason:** [why this improves the memory system]
Report:
This sub-flow implements the Hermes-style self-improving memory pattern inspired by the NousResearch agent. Instead of waiting for explicit memory commands, the system periodically intercepts a user turn, spawns a background job, reviews the conversation, and saves only the durable things worth keeping — so the agent grows with the user without distracting from the main task.
Trigger this sub-flow automatically every 10 conversation turns. The trigger point is the user's message on turn 10, 20, 30, etc. That user message is intercepted and a new background job is spawned.
The main response continues normally. The review must happen asynchronously so it does not delay, distract, or derail the agent handling the active task.
Also trigger this sub-flow whenever the latest work reveals durable preferences, workflow expectations, or project facts that should persist beyond the current task, even if the 10-turn cadence has not been hit yet.
If the discovery pass above identifies conversation-derived memory as a target, run it immediately on the full conversation so far.
Read the full conversation above. Apply this exact review prompt:
Review the conversation above and consider saving to memory if appropriate.
Focus on:
- Has the user revealed things about themselves — their persona, desires, preferences, or personal details worth remembering?
- Has the user expressed expectations about how you should behave, their work style, or ways they want you to operate?
If something stands out, save it using the memory tool. If nothing is worth saving, just say "Nothing to save." and stop.
Do not paraphrase or soften the prompt. Use it as written.
For each piece of information worth saving, decide which durable destination it belongs to:
| Finding type | Target |
|---|---|
| User persona, goals, background, personal preferences | Personal memory (USER.md-equivalent) |
| Work style expectations, behavior instructions, corrections | Personal/operating memory (USER.md-equivalent) |
| Project context, technical decisions, timelines, constraints, repo facts | Technical memory (MEMORY.md-equivalent) |
| Repeated workflow or technique the user keeps applying | New skill file |
In this repository's memory system, that means:
USER.md.MEMORY.md.Do not save what is already in memory. Before writing, check existing memory files for duplicates or superseded entries.
For memory targets — preserve the Hermes destination model:
USER.md-equivalent destinationMEMORY.md-equivalent destinationWhen this repo uses discrete memory files instead of literal USER.md / MEMORY.md, map the content into the equivalent structure without losing the distinction between personal memory and technical memory.
For structured memory files — write using the standard frontmatter format:
---
name: [descriptive name]
description: [one-line description for MEMORY.md index]
type: user | feedback | project | reference
---
[content — for feedback/project types: rule/fact, then **Why:** and **How to apply:** lines]
Then add or update the pointer in MEMORY.md.
For skill targets — only create a new skill if the pattern is genuinely reusable across sessions (not just a one-off technique). Use the standard SKILL.md frontmatter with name and description.
When running as a background job, do not interrupt the main conversation. Silently save the files. After the main response is delivered, append a one-line status:
(Background memory review: saved [N] items — [brief list of what was saved])
If nothing was saved: no status line needed.
USER.md-equivalent memory; technical/project facts belong in MEMORY.md-equivalent memory; reusable procedures belong in skills.Use this sub-flow when the real failure is not missing task logic but missing first-party voice grounding.
Load references/writing-style-learning.md for the source-order and diagnosis heuristics before changing the guide or the consuming skill.
Trigger when one or more of these are true:
writing-style-guide.md but the agent is repeatedly drafting on behalf of the same sender, founder, exec, team, or brandPin down:
Do not merge multiple voices into one guide unless the evidence says they are intentionally shared.
Look for the strongest available sources in this order:
writing-style-guide.md or brand voice docsDo not learn voice from inbound messages, quoted reply text, signatures, disclaimers, or auto-generated boilerplate.
Bias recent content higher than old content. Separate the analysis by channel.
For each channel, capture:
Also write a General layer that captures patterns common across all channels.
Create or update writing-style-guide.md using templates/writing-style-guide.md.
The guide must:
General section plus one section per meaningful channelUpdate the relevant drafting skill, agent, or root documentation so future writing reads the guide first and uses General + closest channel section.
Prefer changing the consumer instructions once over repeating style notes in every future task response.
When the user corrects a draft, decide whether the correction belongs in:
Do not keep stuffing voice rules into the drafting skill if the durable fix is really a better style artifact.
STOP. Do not run any experiments until all fields below are grounded in evidence from the latest work. Derive them; only ask the user if a critical ambiguity cannot be resolved from context.
.md, or documentation .mdThe loop is not only trying to improve the current target. It is also trying to improve the improvement process itself.
That means:
Do not assume the evaluation task and the self-modification task are perfectly aligned. A good output on one task does not automatically imply a good mutation strategy. Explicitly reason about the meta-level process.
Before changing anything, read and understand the target completely:
references/ that the skill links towriting-style-guide.md or equivalent voice doc when one existsDo NOT skip this. You need to understand what the target does before you can improve it.
Convert the intended behavior plus the evidence package into a structured test. Every check must be binary — pass or fail, no scales.
Format each eval as:
EVAL [N]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like — be specific]
Fail condition: [What triggers a "no"]
Max score: [number of evals] × [test inputs] × [runs per input]
Create auto-improve-[name]/dashboard.html — a self-contained HTML file with inline CSS and JS. Open it immediately: open auto-improve-[name]/dashboard.html.
The dashboard must:
results.jsonUpdate auto-improve-[name]/results.json after every experiment. Format:
{
"target": "[name]",
"status": "running",
"current_experiment": 3,
"baseline_score": 70.0,
"best_score": 90.0,
"memory_summary": [
"Mutation ordering mattered more than instruction count",
"Experiment 2 overcorrected and hurt routing precision"
],
"experiments": [
{
"id": 0,
"score": 14,
"max_score": 20,
"pass_rate": 70.0,
"status": "baseline",
"description": "original — no changes"
}
],
"eval_breakdown": [
{"name": "Eval name", "pass_count": 8, "total": 10}
]
}
When the run ends, set "status": "complete".
auto-improve-[name]/results.tsv with header rowauto-improve-[name]/[filename].baselineself_improvement_memory.md and archive/self_improvement_memory.md with any relevant lessons from prior runs on similar targetsresults.tsv format (tab-separated):
experiment score max_score pass_rate status description
0 14 20 70.0% baseline original — no changes
IMPORTANT: After baseline, do not pause for approval. Continue automatically if the target still shows meaningful failures or if the issue is high leverage. If baseline is already 90%+ and the remaining gap is minor, skip this target and move to the next candidate instead of optimizing for noise.
LOOP AUTONOMOUSLY. Do not pause between experiments unless you hit a real blocker or need unavailable information.
Each iteration:
Analyze failures. Which evals fail most? Read the actual outputs that failed. Identify the pattern: formatting issue, missing instruction, ambiguous directive?
Consult memory. Read self_improvement_memory.md and the best archive entries before proposing the next change. Check whether the current failure resembles a prior one, whether a previous mutation overcorrected, and whether two stepping stones should be combined.
Form ONE hypothesis. Pick one thing to change. Never change multiple things at once — you will not know what helped.
Make the mutation. Edit the target file with one targeted change. See target-specific mutation guide below.
Run all test inputs. Score every output against every eval.
Decide:
Update persistent memory. After every experiment, write down:
Archive stepping stones. Save every kept improvement and any especially informative near-miss into archive/ with a short note explaining why it matters. A stepping stone is any variant that teaches something reusable, not just the current winner.
Log the result in results.tsv and results.json.
Go back to step 1.
Stop only when:
If stuck: Re-read failing outputs. Try combining two previous near-miss mutations. Try removing things instead of adding. Simplification that maintains the score is a win.
After every experiment (kept or discarded), append to auto-improve-[name]/changelog.md:
## Experiment [N] — [keep/discard]
**Score:** [X]/[max] ([percent]%)
**Change:** [One sentence describing what was changed]
**Reasoning:** [Why this change was expected to help]
**Result:** [Which evals improved or declined]
**Remaining failures:** [What still fails, if anything]
Also update auto-improve-[name]/self_improvement_memory.md with synthesized memory entries, not raw logs. Each entry should capture:
## Memory [N] — [short title]
**Context:** [target type + failure pattern]
**Hypothesis:** [causal belief about what would help]
**Outcome:** [what happened]
**Interpretation:** [why this likely happened]
**Transferability:** [where else this lesson should apply]
**Next move:** [forward-looking plan]
This file is the persistent memory of the improvement process itself. It must be actively consulted in later experiments and in later runs on related targets.
When the loop ends, present:
If more selected targets remain, continue to the next one instead of treating the first target as the whole job.
Before moving to the next target, review whether any lessons from the finished target should transfer. If yes, write them into the next target's seeded memory and cite the source archive entry.
Only run this step if the target is source-tracked (i.e. it belongs to alvarovillalbaa/agent-suite).
Do not run this step for external targets. For external targets, the local file is already the authoritative copy — no PR is needed.
After the loop ends and the final improved version is in place on disk:
Create a branch in the local repo:
git -C <repo-root> checkout -b auto-improve/<skill-name>-<YYYY-MM-DD>
Use the skill or agent name (lowercased, hyphens) and today's date.
Stage only the improved file (never use git add -A):
git -C <repo-root> add <relative-path-to-improved-file>
Commit with a descriptive message summarising the top mutation that improved the score:
git -C <repo-root> commit -m "auto-improve(<skill-name>): <one-line description of top change>"
Push the branch:
git -C <repo-root> push origin auto-improve/<skill-name>-<YYYY-MM-DD>
Open a PR against main using gh:
gh pr create \
--repo alvarovillalbaa/agent-suite \
--base main \
--head auto-improve/<skill-name>-<YYYY-MM-DD> \
--title "auto-improve(<skill-name>): <one-line description>" \
--body "$(cat <<'EOF'
## Summary
- Baseline score: <X>%
- Final score: <Y>%
- Experiments run: <N> (<kept> kept, <discarded> discarded)
## Top changes
- <change 1>
- <change 2>
- <change 3>
## Remaining failures
<list or "none">
---
_Generated by auto-improve eval loop. Changelog and results: `auto-improve-<name>/`_
EOF
)"
Report the PR URL to the user.
Important constraints:
main.-v2) rather than overwriting.gh is not authenticated or the push fails, report the error clearly and leave the improved file in place on the branch for the user to push manually.Good mutations:
Bad mutations:
Mutation scope: the body text of SKILL.md — instructions, anti-patterns, examples, ordering.
Dynamic content injection: If the skill depends on context that changes per repo or session, you can embed !command`` placeholders in SKILL.md to inject live shell output at invocation time — the model only ever sees the result, not the raw placeholder. This requires the skill's frontmatter to declare allowed-tools for every tool the command needs (e.g. allowed-tools: Bash(git branch --show-current)). Treat injected commands with the same scrutiny as postinstall scripts — they run with full shell permissions. Prefer read-only introspection (git, cat, jq) over commands with side effects.
Good mutations:
description trigger phrases so the agent is invoked at the right momentsWhen to use section to reduce false positives and false negativesBad mutations:
Mutation scope: the frontmatter description, When to use, commands/skills tables, workflow steps.
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, DESIGN.md, README.md, ARCHITECTURE.md, TESTS.md, SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md, OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md, CONTRIBUTING.md, TESTING.md, logs/, lessons/, items/, fixes/, audits/, raw/, plans/, specs/, sources/, lib/, references/, cookbook/, knowledge/, runbooks/, research/, official-documentation/, context/, runbooks/**/*.md)Good mutations:
code-documentation contract consistently: Core docs (README.md, ARCHITECTURE.md, TESTS.md), Conditional docs (SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md), Rare docs (OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md), root instruction docs, timestamped AFS docs on */YYYY/YYYY-MM-DD/*.md, and living AFS docs in specs/, sources/, lib/, references/, cookbook/, knowledge/, runbooks/, research/, official-documentation/, and context/runbooks/ or RUNBOOK.md when they are currently scattered across long docsAGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, and DESIGN.md as first-class documentation targets that should improve automatically when the repo's operating model changeswriting-style-guide.md when repeated draft corrections point to missing voice guidance rather than missing workflow guidanceBad mutations:
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, or DESIGN.md stale after the repo's operating model changesMutation scope: headings, ordering, wording, examples, checklists, cross-links, and stale-content removal inside the target documentation file.
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, or DESIGN.md when the missing rule is repo-wide rather than skill-specific.runbooks/ or RUNBOOK.md when the missing content is an operational workflow rather than a policy or concept.Every eval must be a yes/no question. Not a scale. Binary.
Why: Scales compound variability. Binary evals give a reliable signal across runs.
Good evals:
Bad evals:
Sweet spot: 3–6 evals. More than 6 and the target starts gaming the criteria instead of actually improving.
Max evals per target type:
All eval-loop targets produce files in auto-improve-[name]/:
auto-improve-[name]/
├── archive/ # stepping-stone variants and notes
├── dashboard.html # live browser dashboard (auto-refreshes every 10s)
├── self_improvement_memory.md # synthesized insights, causal hypotheses, transfer notes, next moves
├── results.json # data file powering the dashboard
├── results.tsv # score log (tab-separated)
├── changelog.md # mutation-by-mutation research log
└── [original-filename].baseline # original file before any changes
Memory audit produces:
auto-improve-memory/
├── audit-report.md # findings across all four audit dimensions
└── changelog.md # what was rewritten, created, or deleted
The improved target file is always saved back to its original location during the loop.
For source-tracked targets (files belonging to alvarovillalbaa/agent-suite), the final improved file is also committed to a new branch and submitted as a GitHub PR — the local file on main is never directly committed. For external targets, the local file is the final destination with no PR needed.
Use this sub-flow when the improvement target is not a skill, agent, or documentation file but a code file, asset, or content file where quality is measured by a concrete external metric (execution time, bundle size, test pass rate, CTR score).
This is distinct from the eval loop because:
pytest bench.pyvite.config.ts; eval: npm run build && du -sb dist/content/titles.md; eval: LLM judge scriptIf .autoresearch/{domain}/{name}/config.cfg does not exist, run /ar:setup first.
Requirements:
metric_name: value to stdout.autoresearch/{domain}/{name}/
├── config.cfg # target, eval cmd, metric, direction, time_budget_minutes
├── program.md # objectives, constraints, strategy notes
└── results.tsv # commit | metric | status | description
results.tsv for history — what worked, what failed, what has not been triedgit add {target} && git commit -m "experiment: {description}"python scripts/run_experiment.py --experiment {domain}/{name} --singlegit reset --hard HEAD~1| Runs | Approach | Risk |
|---|---|---|
| 1–5 | Low-hanging fruit (obvious improvements, simple cache/index/IO changes) | Low |
| 6–15 | Systematic exploration (vary one parameter at a time) | Medium |
| 16–30 | Structural changes (algorithm swaps, architecture shifts) | High |
| 30+ | Radical experiments (completely different approaches) | Very High |
If no improvement after 20 consecutive runs → update the Strategy section of program.md.
After every 10 experiments, review results.tsv for patterns and update program.md:
program.mdconfig.cfg reached/ar:resume)Autoresearch complete — {domain}/{name}
Target: {file}
Metric: {name} ({direction})
Baseline: {start value}
Best: {best value} ({delta}% improvement, run N)
Experiments: {total} ({kept} kept, {discarded} discarded, {crashed} crashed)
Branch: autoresearch/{domain}/{name}
A good auto-improve run:
npx claudepluginhub alvarovillalbaa/plugins --plugin learning-systemGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.