Skill

ecos-failure-recovery

From emasoft-chief-of-staff

Use when recovering from agent failures or coordinating agent replacements. Trigger with failure events.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/emasoft-chief-of-staff:ecos-failure-recovery

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.

Supporting Files

references/agent-replacement-protocol.mdreferences/examples.mdreferences/failure-classification.mdreferences/failure-detection.mdreferences/op-classify-failure-severity.mdreferences/op-detect-agent-failure.mdreferences/op-emergency-handoff.mdreferences/op-execute-recovery-strategy.mdreferences/op-replace-agent.mdreferences/op-route-task-blocker.mdreferences/recovery-operations.mdreferences/recovery-strategies.mdreferences/troubleshooting.mdreferences/work-handoff-during-failure.md

SKILL.md

472 lines · ~4.7k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitFeb 15, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

ECOS Failure Recovery Skill

Overview

This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.

When to use this skill:

When monitoring remote agent health
When an agent becomes unresponsive
When an agent crashes or terminates unexpectedly
When work must be transferred due to agent failure
When a replacement agent must be created and onboarded

Prerequisites

Before using this skill, ensure:

AI Maestro is running locally
Agent registry is accessible
Recovery scripts are available in scripts/

Instructions

Detect the failure using heartbeat monitoring, message delivery status, or task completion timeouts
Classify the failure as transient, recoverable, or terminal using the classification criteria
Apply the appropriate recovery strategy based on failure type
If recovery fails or failure is terminal, initiate agent replacement protocol
For critical deadlines, use emergency handoff to transfer work immediately
Document all incidents in the incident log and notify EAMA of outcomes

Checklist

Copy this checklist and track your progress:

## ECOS Failure Response Checklist

Agent: _______________
Failure detected: _______________

### Detection
- [ ] Heartbeat status checked
- [ ] AI Maestro agent status queried
- [ ] Message delivery verified
- [ ] Task progress reviewed

### Classification
- [ ] Failure type determined: [ ] Transient [ ] Recoverable [ ] Terminal
- [ ] Evidence documented
- [ ] Incident logged

### Response (choose path)

#### If Transient:
- [ ] Waited for auto-recovery (< 5 min)
- [ ] Verified agent responsive
- [ ] Resumed normal monitoring

#### If Recoverable:
- [ ] Manager notified
- [ ] Recovery strategy selected
- [ ] Recovery attempted
- [ ] Recovery verified OR escalated to replacement

#### If Terminal:
- [ ] Manager notified
- [ ] Replacement approval requested
- [ ] Artifacts preserved
- [ ] Replacement agent created
- [ ] Orchestrator notified
- [ ] Handoff documentation sent
- [ ] New agent acknowledged
- [ ] Incident closed

### Emergency Handoff (if deadline critical):
- [ ] Critical tasks identified
- [ ] Orchestrator notified
- [ ] Receiving agent assigned
- [ ] Handoff documentation created
- [ ] Work transferred
- [ ] Deadline met OR escalated

Output

Recovery Type	Output
Agent restart	Agent back online, state restored
Communication	Message queue cleared, connection restored
State	Corrupted state replaced with backup

Quick Reference: Failure Response Workflow

DETECT --> CLASSIFY --> RESPOND
   |           |           |
   v           v           v
Heartbeat    Transient?    Wait & Retry
timeout?     --> Yes -->   (auto-recover)
   |              |
Message          No
delivery         |
failed?          v
   |         Recoverable?
Agent        --> Yes -->   Restart / Wake
offline?          |        (intervention needed)
                  |
                  No
                  |
                  v
              Terminal -->  Replace Agent
                           (full protocol)

Procedures Summary

Phase	Action	Reference Document
1	Detect failure	failure-detection.md
2	Classify severity	failure-classification.md
3	Attempt recovery	recovery-strategies.md
4	Replace if terminal	agent-replacement-protocol.md
5	Emergency handoff	work-handoff-during-failure.md

Phase 1: Failure Detection

Before responding to a failure, ECOS must first detect that a failure has occurred.

Read references/failure-detection.md for:

1.1-1.2 Overview and when to use
1.3 Heartbeat monitoring via AI Maestro
1.4 Message delivery failure detection
1.5 Task completion timeout detection
1.6 Agent status API queries
1.7 Decision flowchart

Mechanism	Signal	Response Time
Heartbeat timeout	Missed pings	30-60 seconds
Message delivery failure	API error	Immediate
Message acknowledgment timeout	No ACK	5-15 minutes
Task completion timeout	Stalled progress	Variable

Phase 2: Failure Classification

Once detected, classify severity to determine response.

Read references/failure-classification.md for:

2.1-2.2 Overview and categories
2.3 Transient failures (auto-recover < 5 min)
2.4 Recoverable failures (intervention needed)
2.5 Terminal failures (replacement required)
2.6-2.8 Decision matrix and escalation

Category	Severity	Recovery	Example
Transient	Low	Automatic (< 5 min)	Network hiccup, API rate limit
Recoverable	Medium	With intervention	Session hibernated, out of memory
Terminal	High	Replacement required	Host crash, disk corruption

Phase 3: Recovery Strategies

For transient and recoverable failures, attempt recovery before escalating.

Read references/recovery-strategies.md for:

3.1-3.2 Overview and strategy selection
3.3 Wait and Retry (transient)
3.4 Restart Agent (soft/hard procedures)
3.5 Hibernate-Wake Cycle
3.6 Resource Adjustment
3.7-3.8 When to replace and flowchart

Strategy	When to Use	Time to Recover
Wait and Retry	Transient failures	1-5 minutes
Restart	Hung/crashed agent	5-15 minutes
Hibernate-Wake	Idle/suspended session	2-5 minutes
Resource Adjustment	Memory/disk exhaustion	15-60 minutes
Replace	All above failed	30-120 minutes

Phase 4: Agent Replacement Protocol

When recovery fails or failure is terminal, create a replacement agent.

Read references/agent-replacement-protocol.md for:

4.1-4.2 Overview
4.3 Phase 1: Failure confirmation and artifact preservation
4.4 Phase 2: Manager notification and approval
4.5 Phase 3: Creating replacement agent
4.6 Phase 4: Orchestrator notification
4.7 Phase 5: Work handoff to new agent
4.8-4.9 Cleanup and complete checklist

Replacement Protocol Summary

ECOS detects terminal failure
         |
         v
ECOS notifies EAMA (manager) --> EAMA approves
         |
         v
ECOS coordinates new agent creation
         |
         v
ECOS notifies EOA (orchestrator) to:
  - Generate handoff document
  - Update GitHub Project kanban
         |
         v
ECOS sends handoff docs to new agent
         |
         v
New agent acknowledges and begins work

Key Consideration: Memory Loss

CRITICAL: The replacement agent has NO MEMORY of the old agent.

The new agent does not know what tasks were assigned, what work was in progress, or the project context. Therefore:

Orchestrator (EOA) must generate handoff documentation
EOA must reassign tasks in GitHub Project kanban
ECOS must send handoff docs to new agent

ROLE BOUNDARY: ECOS creates agents and sends context. EOA owns task assignment.

Phase 5: Emergency Work Handoff

When critical work cannot wait for full replacement protocol.

Read references/work-handoff-during-failure.md for:

5.1-5.2 Overview and when to use
5.3 Triggering emergency handoff
5.4 Creating emergency handoff documentation
5.5 Reassigning work during failure
5.6 Emergency handoff message formats
5.7 Post-failure work reconciliation

Aspect	Regular Handoff	Emergency Handoff
Timing	After replacement ready	Immediately
Completeness	Full context	Minimum viable
Recipient	Replacement agent	Any available agent
Duration	Permanent	Temporary

Task Blockers vs Agent Failures

ECOS handles TWO types of escalations differently:

Agent Failures (ECOS resolves directly)

An agent failure occurs when an agent crashes, becomes unresponsive, or repeatedly fails. ECOS handles this by:

Detecting the failure (heartbeat, message timeout, task timeout)
Attempting recovery (wake, restart)
If unrecoverable: replacing the agent and handing off work
Notifying EAMA of the emergency handoff (notification only, no approval needed)

Task Blockers (ECOS routes to EAMA for user decision)

A task blocker occurs when work cannot proceed due to missing information, access, or a decision that only the user can make. When ECOS receives a task blocker escalation from EOA:

Determine if ECOS can resolve the blocker:
- Agent reassignment needed → ECOS handles directly
- Permission or access issue within ECOS authority → ECOS handles directly
- User decision or input needed → Route to EAMA
If routing to EAMA, use this template:

Note: Use the agent-messaging skill to send messages. The JSON structure below shows the message content.

{
  "from": "ecos-chief-of-staff",
  "to": "eama-assistant-manager",
  "subject": "BLOCKER: Task requires user decision",
  "priority": "high",
  "content": {
    "type": "blocker-escalation",
    "message": "A task is blocked and requires user input. EOA has escalated this after determining the blocker cannot be resolved by agents.",
    "task_uuid": "[task-uuid]",
    "issue_number": "[GitHub issue number of the blocked task]",
    "blocker_issue_number": "[GitHub issue number tracking the blocker problem]",
    "blocker_type": "user-decision",
    "blocker_description": "[What is blocking and why agents cannot resolve it]",
    "impact": "[Affected agents and tasks]",
    "options": ["[Options if available]"],
    "escalated_from": "eoa-[project-name]",
    "original_blocker_time": "[ISO8601 timestamp]"
  }
}

Track the blocker in ECOS records
When EAMA responds with user's decision, route it back to EOA

Decision Tree

ECOS receives escalation
  │
  ├─ Is it an agent failure? (crash, unresponsive, repeated failure)
  │   └─ YES → Handle via failure recovery workflow (this skill)
  │
  ├─ Is it a task blocker that ECOS can resolve?
  │   ├─ Agent reassignment → Handle directly
  │   └─ Permission within authority → Handle directly
  │
  └─ Is it a task blocker requiring user input?
      └─ YES → Route to EAMA using blocker-escalation template above

Checklist: Routing a Task Blocker Escalation

Copy this checklist and track your progress:

Checklist: When EAMA Returns a Blocker Resolution

Copy this checklist and track your progress:

Receive blocker-resolution message from EAMA
Verify the resolution includes the user's exact decision (RULE 14)
Route the resolution to EOA via AI Maestro
Verify EOA acknowledges receipt
Note: EOA will close the blocker issue, restore the task to its previous column, and notify the agent
Update ECOS records to mark the blocker as resolved

File Locations

Data	Location
Heartbeat configuration	`$CLAUDE_PROJECT_DIR/.ecos/agent-health/heartbeat-config.json`
Task tracking	`$CLAUDE_PROJECT_DIR/.ecos/agent-health/task-tracking.json`
Incident log	`$CLAUDE_PROJECT_DIR/.ecos/agent-health/incident-log.jsonl`
Recovery log	`$CLAUDE_PROJECT_DIR/.ecos/agent-health/recovery-log.jsonl`
Handoff documents	`$CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/AGENT_NAME/`
Emergency handoffs	`$CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/emergency/`

Manager Notification Priorities

Situation	Priority	Message Type
Transient failure (pattern)	`normal`	`escalation`
Recoverable failure detected	`high`	`failure-report`
Recovery attempt failed	`high`	`failure-report`
Terminal failure detected	`urgent`	`replacement-request`
Emergency handoff initiated	`urgent`	`emergency-handoff-notification`
Replacement complete	`normal`	`replacement-complete`

Operational Procedures

Step-by-step runbooks for executing individual failure recovery operations. Use these when performing a specific operation within the failure recovery workflow.

op-detect-agent-failure.md - Detect Agent Failure: Check heartbeat status, query AI Maestro agent status, verify message delivery, and determine if agent is unresponsive or crashed
op-classify-failure-severity.md - Classify Failure Severity: Classify a detected failure as transient, recoverable, or terminal using the decision matrix to determine the appropriate response strategy
op-execute-recovery-strategy.md - Execute Recovery Strategy: Apply the appropriate recovery strategy (wait and retry, soft/hard restart, hibernate-wake cycle, or resource adjustment) based on failure classification
op-replace-agent.md - Replace Agent: Create a replacement agent when recovery fails or failure is terminal, including failure confirmation, manager approval, agent creation, orchestrator notification, and work handoff
op-emergency-handoff.md - Emergency Work Handoff: Transfer critical work immediately when deadlines cannot wait for the full agent replacement protocol
op-route-task-blocker.md - Route Task Blocker: Determine how to handle task blocker escalations -- resolve directly if within ECOS authority, or route to EAMA if user decision is required

Troubleshooting

Common issues when recovering from agent failures.

Read references/troubleshooting.md for:

Agent shows online but unresponsive -> Verify hooks and send status inquiry
Cannot determine failure type -> Default to recoverable, attempt strategies in order
Manager does not respond -> Wait 15 min, send reminder, escalate to user if needed
New replacement agent fails to register -> Verify AI Maestro health and hooks
Emergency handoff deadline missed -> Document, notify stakeholders, conduct post-mortem

Handoff Validation Checklist

Before sending any handoff document (regular or emergency), validate using this checklist:

### Handoff Validation Checklist

Before sending handoff:
- [ ] All required fields present (from/to/type/UUID/task)
- [ ] UUID is unique (check existing handoffs: `ls $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/`)
- [ ] Target agent exists and is alive (use the `ai-maestro-agents-management` skill to list agents and verify the target is online)
- [ ] All referenced files exist (`test -f <path> && echo "EXISTS" || echo "MISSING"`)
- [ ] No placeholder [TBD] markers (`grep -r "\[TBD\]" handoff.md`)
- [ ] Document is valid markdown (no broken links, proper formatting)
- [ ] Acceptance criteria clearly defined
- [ ] Current state accurately reflects reality
- [ ] Contact information for questions provided

Required fields for failure recovery handoffs:

Field	Description	Example
`from`	Sending agent name	`ecos-chief-of-staff`
`to`	Target agent name	`replacement-agent-001`
`type`	Handoff type	`emergency-handoff`, `replacement-handoff`
`UUID`	Unique handoff identifier	`EH-20250204-svgbbox-001`
`task`	Task being handed off	`Implement bounding box calculation`
`failed_agent`	Name of failed agent	`libs-svg-svgbbox`
`failure_reason`	Why agent failed	`Terminal crash - disk corruption`

Error Handling

Error	Cause	Resolution
Agent unresponsive	Network issue or crash	Send ping, wait 30s, then classify
Recovery failed	State corrupted	Escalate to terminal, request replacement
Handoff rejected	Target agent busy	Queue handoff, retry in 5 minutes
AI Maestro unavailable	Server down	Use fallback file-based communication

Examples

Recovery scenarios with step-by-step commands.

Read references/examples.md for:

Agent crash recovery (recoverable -> restart -> verify)
Terminal failure with replacement (3 crashes -> replace -> handoff)
Transient network failure (wait -> auto-recover)
Emergency handoff with deadline (immediate reassignment)
Quick command reference (heartbeat, status, restart, approval, handoff)

Resources

references/failure-detection.md - Detection procedures
references/failure-classification.md - Classification criteria
references/recovery-strategies.md - Recovery steps
references/recovery-operations.md - Recovery operations procedures
references/agent-replacement-protocol.md - Replacement workflow
references/work-handoff-during-failure.md - Work transfer procedures
references/troubleshooting.md - Common issues and solutions
references/examples.md - Complete recovery examples

ecos-failure-recovery

Invocation

Context Preview

Supporting Files

SKILL.md

ecos-failure-recovery

Invocation

Context Preview

Supporting Files

SKILL.md

ECOS Failure Recovery Skill

Overview

Prerequisites

Instructions

Checklist

Output

Quick Reference: Failure Response Workflow

Procedures Summary

Phase 1: Failure Detection

Phase 2: Failure Classification

Phase 3: Recovery Strategies

Phase 4: Agent Replacement Protocol

Replacement Protocol Summary

Key Consideration: Memory Loss

Phase 5: Emergency Work Handoff

Task Blockers vs Agent Failures

Agent Failures (ECOS resolves directly)

Task Blockers (ECOS routes to EAMA for user decision)

Decision Tree

Checklist: Routing a Task Blocker Escalation

Checklist: When EAMA Returns a Blocker Resolution

File Locations

Manager Notification Priorities

Operational Procedures

Troubleshooting

Handoff Validation Checklist

Error Handling

Examples

Resources

Similar Skills

ECOS Failure Recovery Skill

Overview

Prerequisites

Instructions

Checklist

Output

Quick Reference: Failure Response Workflow

Procedures Summary

Phase 1: Failure Detection

Phase 2: Failure Classification

Phase 3: Recovery Strategies

Phase 4: Agent Replacement Protocol

Replacement Protocol Summary

Key Consideration: Memory Loss

Phase 5: Emergency Work Handoff

Task Blockers vs Agent Failures

Agent Failures (ECOS resolves directly)

Task Blockers (ECOS routes to EAMA for user decision)

Decision Tree

Checklist: Routing a Task Blocker Escalation

Checklist: When EAMA Returns a Blocker Resolution

File Locations

Manager Notification Priorities

Operational Procedures

Troubleshooting

Handoff Validation Checklist

Error Handling

Examples

Resources

Similar Skills