From ecc
Diagnoses 12-layer agent stack failures including wrapper regression, memory pollution, tool discipline issues, hidden repair loops, and rendering corruption. Produces severity-ranked findings with code fixes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ecc:agent-architecture-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A diagnostic workflow for agent systems that hide failures behind wrapper layers, stale memory, retry loops, or transport/rendering mutations.
A diagnostic workflow for agent systems that hide failures behind wrapper layers, stale memory, retry loops, or transport/rendering mutations.
MANDATORY for:
Especially critical when:
Do not use for:
agent-introspection-debuggingsecurity-review or security-review/scanagent-evalEvery agent system has these layers. Any of them can corrupt the answer:
| # | Layer | What Goes Wrong |
|---|---|---|
| 1 | System prompt | Conflicting instructions, instruction bloat |
| 2 | Session history | Stale context injection from previous turns |
| 3 | Long-term memory | Pollution across sessions, old topics in new conversations |
| 4 | Distillation | Compressed artifacts re-entering as pseudo-facts |
| 5 | Active recall | Redundant re-summary layers wasting context |
| 6 | Tool selection | Wrong tool routing, model skips required tools |
| 7 | Tool execution | Hallucinated execution — claims to call but doesn't |
| 8 | Tool interpretation | Misread or ignored tool output |
| 9 | Answer shaping | Format corruption in final response |
| 10 | Platform rendering | Transport-layer mutation (UI, API, CLI mutates valid answers) |
| 11 | Hidden repair loops | Silent fallback/retry agents running second LLM pass |
| 12 | Persistence | Expired state or cached artifacts reused as live evidence |
The base model produces correct answers, but the wrapper layers make it worse.
Symptoms:
Old topics leak into new conversations through history, memory retrieval, or distillation.
Symptoms:
Tools are declared in the prompt but not enforced in code. The model skips them or hallucinates execution.
Symptoms:
The agent's internal answer is correct, but the platform layer mutates it during delivery.
Symptoms:
Silent repair, retry, summarization, or recall agents run without explicit contracts.
Symptoms:
Define what you're auditing:
Gather evidence from the codebase:
Use rg to search for anti-patterns:
# Tool requirements expressed only in prompt text (not code)
rg "must.*tool|必须.*工具|required.*call" --type md
# Tool execution without validation
rg "tool_call|toolCall|tool_use" --type py --type ts
# Hidden LLM calls outside main agent loop
rg "completion|chat\.create|messages\.create|llm\.invoke"
# Memory admission without user-correction priority
rg "memory.*admit|long.*term.*update|persist.*memory" --type py --type ts
# Fallback loops that run additional LLM calls
rg "fallback|retry.*llm|repair.*prompt|re-?prompt" --type py --type ts
# Silent output mutation
rg "mutate|rewrite.*response|transform.*output|shap" --type py --type ts
For each finding, document:
Default fix order (code-first, not prompt-first):
| Level | Meaning | Action |
|---|---|---|
critical | Agent can confidently produce wrong operational behavior | Fix before next release |
high | Agent frequently degrades correctness or stability | Fix this sprint |
medium | Correctness usually survives but output is fragile or wasteful | Plan for next cycle |
low | Mostly cosmetic or maintainability issues | Backlog |
Present findings to the user in this order:
Do not lead with compliments or summaries. If the system is broken, say so directly.
When auditing an agent system, answer these:
| # | Question | If Yes → |
|---|---|---|
| 1 | Can the model skip a required tool and still answer? | Tool not code-gated |
| 2 | Does old conversation content appear in new turns? | Memory contamination |
| 3 | Is the same info in system prompt AND memory AND history? | Context duplication |
| 4 | Does the platform run a second LLM pass before delivery? | Hidden repair loop |
| 5 | Does the output differ between internal generation and user delivery? | Rendering corruption |
| 6 | Are "must use tool X" rules only in prompt text? | Tool discipline failure |
| 7 | Can the agent's own monologue become persistent memory? | Memory poisoning |
Audits should produce structured reports following this shape:
{
"schema_version": "ecc.agent-architecture-audit.report.v1",
"executive_verdict": {
"overall_health": "high_risk",
"primary_failure_mode": "string",
"most_urgent_fix": "string"
},
"scope": {
"target_name": "string",
"model_stack": ["string"],
"layers_to_audit": ["string"]
},
"findings": [
{
"severity": "critical|high|medium|low",
"title": "string",
"mechanism": "string",
"source_layer": "string",
"root_cause": "string",
"evidence_refs": ["file:line"],
"confidence": 0.0,
"recommended_fix": "string"
}
],
"ordered_fix_plan": [
{ "order": 1, "goal": "string", "why_now": "string", "expected_effect": "string" }
]
}
agent-introspection-debugging — Debug agent runtime failures (loops, timeouts, state errors)agent-eval — Benchmark agent performance head-to-headsecurity-review — Security audit for code and configurationautonomous-agent-harness — Set up autonomous agent operationsagent-harness-construction — Build agent harnesses from scratchnpx claudepluginhub burgebj/agentharnessDiagnoses 12-layer agent stack failures including wrapper regression, memory pollution, tool discipline issues, hidden repair loops, and rendering corruption. Produces severity-ranked findings with code fixes.
Diagnoses failures in LLM agent systems across 12 stack layers including wrapper regression, memory pollution, tool discipline failures, hidden retry loops, and rendering corruption. Generates severity-ranked findings.
Audits Claude Code agents for violations, gaps, and improvements across 7 dimensions like description quality and frontmatter, outputting structured repair plans.