Error Recovery Protocol | agent-guardrails

Stats

Actions

Tags

Error Recovery Protocol | agent-guardrails

Error Recovery Protocol

How to recover from failures, errors, and unexpected states without making things worse.

Recovery Principles

Stop the bleeding first
Understand before fixing
Never make two changes at once
Have a rollback plan

Recovery Steps

Step 1: Stop and Assess

Do NOT immediately try another fix
Note the exact error message and context
Determine if the failure is isolated or cascading
Check if any data was corrupted or lost

Step 2: Log the Failure

Document what was attempted
Document the exact error
Document the state of the system before and after
This feeds into the failure registry for regression prevention

Step 3: Identify Root Cause

Read the error carefully (stack traces, logs, output)
Look for the FIRST error, not the last (cascading errors hide the root)
Check for recent changes that may have triggered it
Verify environment state (files, configs, dependencies)

Step 4: Choose Recovery Path

Situation	Action
Clear cause, safe fix	Apply targeted fix
Unclear cause	HALT and ask user for guidance
Data corruption	Restore from backup or rollback
Environment issue	Rebuild/reset environment
Dependency issue	Update, downgrade, or pin dependency

Step 5: Verify Recovery

Reproduce the original scenario
Confirm the fix resolves the issue
Check no new issues were introduced
Run relevant tests

Forbidden Recovery Patterns

NEVER do these when recovering from errors:

Blind retry — Trying the same thing again without understanding why it failed
Shotgun debugging — Changing multiple things at once hoping one works
Comment-and-forget — Commenting out broken code instead of fixing it
Production hotfix without testing — Pushing fixes directly to production
Ignoring test failures — "It's just a flaky test" without investigation

Rollback Checklist

Before any significant change, know your rollback plan:

Can you revert the change with a single git command?
Is the previous version known to work?
Can you restore data if needed?
Is there a database migration to undo?
Will rollback affect other users/systems?

Escalation Criteria

Escalate to user when:

Root cause cannot be determined in 2 attempts
Recovery requires destructive operations (delete, reset, restore from backup)
Error affects production data
You're on your 3rd recovery attempt (Three Strikes Rule)

Pi Enforcement

When running in pi, error recovery is supported by the @architectit/pi-guardrails extension:

Three Strikes: guardrail_record_attempt tracks failures automatically — at 3 strikes, halting is recommended
Violation logging: guardrail_log_violation records failure context for post-mortem analysis
Sandbox: guardrail_mcp with action sandbox_run provides isolated execution for testing recovery steps
Bash safety: Prevents destructive commands during recovery (no rm -rf, sudo, etc.)

See [[sandbox-isolation]] and [[guardrails-core]] for details.

Task

Apply the Error Recovery Protocol to the current failure or error situation. Guide the user through stopping, assessing, understanding, and fixing the problem without making it worse. Follow the recovery steps above and escalate when the escalation criteria are met.

References

docs/workflows/ROLLBACK_PROCEDURES.md — Detailed rollback procedures
docs/workflows/AGENT_ESCALATION.md — When and how to escalate
docs/standards/TEST_PRODUCTION_SEPARATION.md — Environment isolation