GROOM
Gated Refresh of Organizational Memory
A self-maintaining knowledge base for AI agents — consulting it is the act that keeps it current.

Read the survey →

Install
In Claude Code, add the marketplace and install the plugin. Two lines:
/plugin marketplace add beconfident-ai/groom
/plugin install groom@groom
You get the harness-wiki skill (consult the bundled knowledge base; consulting it is
what triggers a gated background refresh) and the groom subagent that returns a
structured, cited brief. Nothing else to wire up.
Other agents (Codex, Gemini CLI, Cursor, Windsurf, Cline, Copilot) read a ready-made
rules file that ships in this repo. See Using it as agent context.
To run the maintenance pipeline or reproduce the benchmarks yourself, clone the repo instead
and follow the Quickstart.
The problem
An LLM agent is only as current as the text it reads. Production agents ground on curated
corpora — internal wikis, convention docs, runbooks, retrieval indices — and those corpora
rot: the field moves, the text does not, and every agent that loads a stale page is
silently degraded. Context engineering manages the window (what reaches the model at
inference time); almost nobody maintains the source.
We made the cost concrete. When a consuming agent treats a corpus as authoritative, injecting
staleness into five facts dropped its answer accuracy on those facts from 100% to 0% while
untouched controls held at 100%. Corpus correctness is load-bearing — and maintaining it is
nobody's immediate job, so it doesn't happen.
What GROOM does
GROOM makes consulting the knowledge base the act that maintains it. A consuming agent
reads the corpus through a skill; that fires a gated launcher which returns in tens of
milliseconds and, when a refresh is due, spawns a detached agent to run one bounded
maintenance operation (lint, prune, expand, research, or iterate). The read never blocks; the
next reader gets the benefit (stale-while-revalidate, for knowledge).
Autonomous edits to a live corpus are the real risk, so every operation is wrapped in a git
checkpoint behind a deterministic, token-free validator. An edit "counts" only if it reports
terminal success, passes structural and fact-level validation, satisfies its postcondition,
and touched nothing outside the corpus — otherwise the working tree is reset to the
pre-operation commit. A bad edit becomes a recoverable no-op, never a committed corruption.
GROOM is content-agnostic (point it at any markdown knowledge base, or scaffold a fresh
one) and retrieval-agnostic (it maintains clean markdown; how an agent retrieves —
progressive disclosure, full-context, BM25, dense — is a pluggable layer, not GROOM's concern).
Results
Every number below is reproduced by the harness in eval/ — no agent calls, no
network (single laptop, Node 22; timings are load-sensitive).
| Property | Result |
|---|
| Staleness matters | A consuming agent's accuracy on affected facts collapses 100% → 0% under corpus staleness; controls hold at 100%. |
| Safety | Across 9 fault classes, the gate rejects every one and restores the corpus byte-identically to the checkpoint (n=450, ~13 ms median). A no-gate baseline that commits unconditionally corrupts the corpus 9/9. |
| Concurrency | The naive debounce stamp is a TOCTOU race — it resolves an 8-way trigger to one run only 28–59% of the time. An atomic mkdir claim fixes it to 500/500. |
| Cost | The validation gate is linear (tens of µs/page, ~14–27 ms at 400 pages, load-sensitive); the read path adds a warm ~50 ms and never blocks. |
| Canaries | Structural validation alone misses 5/5 semantic-loss injections; fact-level canaries catch all 5 — at zero token cost. |
| Generalization | Across 3 unrelated agent-KB domains (an internal API/SDK reference, an SRE runbook, a SaaS support KB) and 2 retrievers (BM25 + dense), grooming yields a 45–51% relative gain in recall@1 (BM25 0.52→0.78, dense 0.56→0.81); a groomed corpus is ~40% smaller. |
Quickstart
npm install
npm test # 11-test behavior suite — free, no agent calls
node eval/fault-matrix.mjs # reproduce the safety benchmark — also free