Use when recovering from agent failures or coordinating agent replacements. Trigger with failure events.
How this skill is triggered — by the user, by Claude, or both
Slash command
/emasoft-chief-of-staff:ecos-failure-recoveryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.
references/agent-replacement-protocol.mdreferences/examples.mdreferences/failure-classification.mdreferences/failure-detection.mdreferences/op-classify-failure-severity.mdreferences/op-detect-agent-failure.mdreferences/op-emergency-handoff.mdreferences/op-execute-recovery-strategy.mdreferences/op-replace-agent.mdreferences/op-route-task-blocker.mdreferences/recovery-operations.mdreferences/recovery-strategies.mdreferences/troubleshooting.mdreferences/work-handoff-during-failure.mdThis skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.
When to use this skill:
Before using this skill, ensure:
Copy this checklist and track your progress:
## ECOS Failure Response Checklist
Agent: _______________
Failure detected: _______________
### Detection
- [ ] Heartbeat status checked
- [ ] AI Maestro agent status queried
- [ ] Message delivery verified
- [ ] Task progress reviewed
### Classification
- [ ] Failure type determined: [ ] Transient [ ] Recoverable [ ] Terminal
- [ ] Evidence documented
- [ ] Incident logged
### Response (choose path)
#### If Transient:
- [ ] Waited for auto-recovery (< 5 min)
- [ ] Verified agent responsive
- [ ] Resumed normal monitoring
#### If Recoverable:
- [ ] Manager notified
- [ ] Recovery strategy selected
- [ ] Recovery attempted
- [ ] Recovery verified OR escalated to replacement
#### If Terminal:
- [ ] Manager notified
- [ ] Replacement approval requested
- [ ] Artifacts preserved
- [ ] Replacement agent created
- [ ] Orchestrator notified
- [ ] Handoff documentation sent
- [ ] New agent acknowledged
- [ ] Incident closed
### Emergency Handoff (if deadline critical):
- [ ] Critical tasks identified
- [ ] Orchestrator notified
- [ ] Receiving agent assigned
- [ ] Handoff documentation created
- [ ] Work transferred
- [ ] Deadline met OR escalated
| Recovery Type | Output |
|---|---|
| Agent restart | Agent back online, state restored |
| Communication | Message queue cleared, connection restored |
| State | Corrupted state replaced with backup |
DETECT --> CLASSIFY --> RESPOND
| | |
v v v
Heartbeat Transient? Wait & Retry
timeout? --> Yes --> (auto-recover)
| |
Message No
delivery |
failed? v
| Recoverable?
Agent --> Yes --> Restart / Wake
offline? | (intervention needed)
|
No
|
v
Terminal --> Replace Agent
(full protocol)
| Phase | Action | Reference Document |
|---|---|---|
| 1 | Detect failure | failure-detection.md |
| 2 | Classify severity | failure-classification.md |
| 3 | Attempt recovery | recovery-strategies.md |
| 4 | Replace if terminal | agent-replacement-protocol.md |
| 5 | Emergency handoff | work-handoff-during-failure.md |
Before responding to a failure, ECOS must first detect that a failure has occurred.
Read references/failure-detection.md for:
| Mechanism | Signal | Response Time |
|---|---|---|
| Heartbeat timeout | Missed pings | 30-60 seconds |
| Message delivery failure | API error | Immediate |
| Message acknowledgment timeout | No ACK | 5-15 minutes |
| Task completion timeout | Stalled progress | Variable |
Once detected, classify severity to determine response.
Read references/failure-classification.md for:
| Category | Severity | Recovery | Example |
|---|---|---|---|
| Transient | Low | Automatic (< 5 min) | Network hiccup, API rate limit |
| Recoverable | Medium | With intervention | Session hibernated, out of memory |
| Terminal | High | Replacement required | Host crash, disk corruption |
For transient and recoverable failures, attempt recovery before escalating.
Read references/recovery-strategies.md for:
| Strategy | When to Use | Time to Recover |
|---|---|---|
| Wait and Retry | Transient failures | 1-5 minutes |
| Restart | Hung/crashed agent | 5-15 minutes |
| Hibernate-Wake | Idle/suspended session | 2-5 minutes |
| Resource Adjustment | Memory/disk exhaustion | 15-60 minutes |
| Replace | All above failed | 30-120 minutes |
When recovery fails or failure is terminal, create a replacement agent.
Read references/agent-replacement-protocol.md for:
ECOS detects terminal failure
|
v
ECOS notifies EAMA (manager) --> EAMA approves
|
v
ECOS coordinates new agent creation
|
v
ECOS notifies EOA (orchestrator) to:
- Generate handoff document
- Update GitHub Project kanban
|
v
ECOS sends handoff docs to new agent
|
v
New agent acknowledges and begins work
CRITICAL: The replacement agent has NO MEMORY of the old agent.
The new agent does not know what tasks were assigned, what work was in progress, or the project context. Therefore:
ROLE BOUNDARY: ECOS creates agents and sends context. EOA owns task assignment.
When critical work cannot wait for full replacement protocol.
Read references/work-handoff-during-failure.md for:
| Aspect | Regular Handoff | Emergency Handoff |
|---|---|---|
| Timing | After replacement ready | Immediately |
| Completeness | Full context | Minimum viable |
| Recipient | Replacement agent | Any available agent |
| Duration | Permanent | Temporary |
ECOS handles TWO types of escalations differently:
An agent failure occurs when an agent crashes, becomes unresponsive, or repeatedly fails. ECOS handles this by:
A task blocker occurs when work cannot proceed due to missing information, access, or a decision that only the user can make. When ECOS receives a task blocker escalation from EOA:
Note: Use the
agent-messagingskill to send messages. The JSON structure below shows the message content.
{
"from": "ecos-chief-of-staff",
"to": "eama-assistant-manager",
"subject": "BLOCKER: Task requires user decision",
"priority": "high",
"content": {
"type": "blocker-escalation",
"message": "A task is blocked and requires user input. EOA has escalated this after determining the blocker cannot be resolved by agents.",
"task_uuid": "[task-uuid]",
"issue_number": "[GitHub issue number of the blocked task]",
"blocker_issue_number": "[GitHub issue number tracking the blocker problem]",
"blocker_type": "user-decision",
"blocker_description": "[What is blocking and why agents cannot resolve it]",
"impact": "[Affected agents and tasks]",
"options": ["[Options if available]"],
"escalated_from": "eoa-[project-name]",
"original_blocker_time": "[ISO8601 timestamp]"
}
}
ECOS receives escalation
│
├─ Is it an agent failure? (crash, unresponsive, repeated failure)
│ └─ YES → Handle via failure recovery workflow (this skill)
│
├─ Is it a task blocker that ECOS can resolve?
│ ├─ Agent reassignment → Handle directly
│ └─ Permission within authority → Handle directly
│
└─ Is it a task blocker requiring user input?
└─ YES → Route to EAMA using blocker-escalation template above
Copy this checklist and track your progress:
blocker_issue_number in the message (the GitHub issue tracking the blocker problem)Copy this checklist and track your progress:
| Data | Location |
|---|---|
| Heartbeat configuration | $CLAUDE_PROJECT_DIR/.ecos/agent-health/heartbeat-config.json |
| Task tracking | $CLAUDE_PROJECT_DIR/.ecos/agent-health/task-tracking.json |
| Incident log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/incident-log.jsonl |
| Recovery log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/recovery-log.jsonl |
| Handoff documents | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/AGENT_NAME/ |
| Emergency handoffs | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/emergency/ |
| Situation | Priority | Message Type |
|---|---|---|
| Transient failure (pattern) | normal | escalation |
| Recoverable failure detected | high | failure-report |
| Recovery attempt failed | high | failure-report |
| Terminal failure detected | urgent | replacement-request |
| Emergency handoff initiated | urgent | emergency-handoff-notification |
| Replacement complete | normal | replacement-complete |
Step-by-step runbooks for executing individual failure recovery operations. Use these when performing a specific operation within the failure recovery workflow.
Common issues when recovering from agent failures.
Read references/troubleshooting.md for:
Before sending any handoff document (regular or emergency), validate using this checklist:
### Handoff Validation Checklist
Before sending handoff:
- [ ] All required fields present (from/to/type/UUID/task)
- [ ] UUID is unique (check existing handoffs: `ls $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/`)
- [ ] Target agent exists and is alive (use the `ai-maestro-agents-management` skill to list agents and verify the target is online)
- [ ] All referenced files exist (`test -f <path> && echo "EXISTS" || echo "MISSING"`)
- [ ] No placeholder [TBD] markers (`grep -r "\[TBD\]" handoff.md`)
- [ ] Document is valid markdown (no broken links, proper formatting)
- [ ] Acceptance criteria clearly defined
- [ ] Current state accurately reflects reality
- [ ] Contact information for questions provided
Required fields for failure recovery handoffs:
| Field | Description | Example |
|---|---|---|
from | Sending agent name | ecos-chief-of-staff |
to | Target agent name | replacement-agent-001 |
type | Handoff type | emergency-handoff, replacement-handoff |
UUID | Unique handoff identifier | EH-20250204-svgbbox-001 |
task | Task being handed off | Implement bounding box calculation |
failed_agent | Name of failed agent | libs-svg-svgbbox |
failure_reason | Why agent failed | Terminal crash - disk corruption |
| Error | Cause | Resolution |
|---|---|---|
| Agent unresponsive | Network issue or crash | Send ping, wait 30s, then classify |
| Recovery failed | State corrupted | Escalate to terminal, request replacement |
| Handoff rejected | Target agent busy | Queue handoff, retry in 5 minutes |
| AI Maestro unavailable | Server down | Use fallback file-based communication |
Recovery scenarios with step-by-step commands.
Read references/examples.md for:
npx claudepluginhub emasoft/emasoft-plugins --plugin emasoft-chief-of-staffCompiles task context from GitHub, kanban, and AI Maestro when an agent fails, then generates handoff documents and coordinates replacement via AMCOS.
Implements circuit breaker logic for agentic tool calls: tracking tool health, transitioning states, reducing scope on failure, and routing to alternatives via capability maps. Use when building fault-tolerant agents with multiple tools.
Provides patterns and principles for building reliable autonomous agents: agent loops (ReAct, Plan-Execute), goal decomposition, reflection, and production guardrails. Useful when designing constrained, domain-specific agents.