From onex
Autonomous close-out orchestrator — 4-phase pipeline with worktree health sweep, full merge-sweep with DIRTY PR triage and queue stall detection, infra health gate, quality sweeps (dod-sweep with per-ticket verification, aislop-sweep, bus-audit, gap detect), integration-sweep hard gate, Playwright regression gate, release, redeploy, and post-release verification (verify-plugin, dashboard-sweep, container health). Compounds — each cycle's merged infrastructure makes the next cycle's gate stricter.
How this skill is triggered — by the user, by Claude, or both
Slash command
/onex:autopilotThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Skill ID**: `onex:autopilot`
Skill ID: onex:autopilot
Version: 3.0.0
Owner: omniclaude
Ticket: OMN-6872
Epic: OMN-5431
claude -pAutopilot runs as independent headless invocations via scripts/cron-closeout.sh:
cron-closeout.sh invokes claude -p per work unit (one phase per invocation).onex_state/pipeline_checkpoints/ and .onex_state/autopilot/cycle-state.yamlscripts/headless-emit-wrapper.sh (unified team event schema)/agent-coordinationNo CronCreate. The CronCreate / /loop pattern is retired for autopilot. CronCreate
fires within a single session, causing context accumulation that exhausts the context window
after 2-3 passes (9 recorded friction events). Headless claude -p eliminates this by design.
No poly dispatch. Autopilot phases execute directly via claude -p with scoped tool
allowlists. The polymorphic-agent indirection is unnecessary for headless invocations where
each phase has a fixed prompt and tool set.
Checkpoint-resume: Each phase writes its result to {run_dir}/{phase_name}.txt. If a
claude -p invocation is interrupted (rate limit, network drop, process kill), re-running
cron-closeout.sh starts a new run from Phase A. Individual phase outputs from previous
runs are preserved in .onex_state/autopilot/runs/ for audit and debugging.
Required environment:
| Variable | Purpose |
|---|---|
ONEX_RUN_ID | Auto-generated by cron-closeout.sh per run |
ONEX_UNSAFE_ALLOW_EDITS | Set to 1 by the script for phases that need write access |
ANTHROPIC_API_KEY | Required for claude -p (sourced from ~/.omnibase/.env) |
GITHUB_TOKEN | Required for PR operations via gh CLI |
Top-level autonomous close-out orchestrator.
In --mode close-out, autopilot executes the full pipeline in 4 phases:
Phase A — Prepare (sequential):
Phase B — Quality Gate (B1-B4 parallel, B4b data verification parallel advisory, B5-B6 sequential hard gates):
/database-sweep --dry-run — projection table health/data-flow-sweep --dry-run --skip-playwright — end-to-end pipeline check/runtime-sweep --dry-run — node registration and wiring integrity
Findings appended to close-day report. Non-blocking — does NOT halt pipeline.Runs three checks with content assertions:
node_service_registry — assert no UUID-only names, no test-* entriesHalt policy:
This gate runs AFTER B1-B4 (infrastructure gates) and BEFORE B5 (integration sweep). It catches the class of bug where infrastructure is healthy but data is garbage.
Phase 1 data-content gating uses coarse severity classes for operational simplicity. Future refinement should distinguish dominant user-facing corruption (hard fail) from isolated or lower-confidence content anomalies (warn or quarantine). For example: one or two garbage rows in a large table may warrant quarantine rather than halt, while dominant UUID-only names across the registry is a clear halt.
B1-B4 are read-only audits, safe to parallelize. Failures in B1-B4 are logged and increment the circuit breaker but do NOT halt the pipeline. B5 and B6 have halt authority.
Phase C — Ship (sequential):
Phase D — Verify (D1-D3 parallel, D4 sequential):
D1-D3 are read-only verification. Failures are logged with warnings but do NOT halt — the release and redeploy already completed successfully.
Note: This is a 20-step pipeline (A0-A3 including A1b, B1-B8, C1-C2, D1-D5). Internal step IDs use the
{phase}{ordinal} scheme for stable naming in cycle records, circuit breaker logs, and
downstream debugging.
Compounding principle: Step A2 (deploy-local-plugin) ensures that quality sweeps in Phase B run with the latest enforcement tools. Each cycle's merged infrastructure makes the next cycle's gate stricter.
In --mode build (default), autopilot queries Linear for unblocked Todo tickets and
dispatches onex:ticket-pipeline for each. Full build-mode spec is in OMN-5120.
/autopilot
/autopilot --mode close-out
/autopilot --mode close-out --require-gate
/autopilot --mode build
Autopilot is invoked exclusively via headless claude -p through scripts/cron-closeout.sh.
Each phase runs as a separate claude -p invocation with a fresh context window.
Architecture follows the headless decomposition pattern from
omnibase_infra/docs/patterns/headless_decomposition.md:
scripts/headless-emit-wrapper.sh (unified team events to Kafka)# Direct invocation (one full close-out cycle)
./scripts/cron-closeout.sh
# Dry run — prints phases without executing claude -p
./scripts/cron-closeout.sh --dry-run
# Via crontab (every 30 minutes)
*/30 * * * * $OMNI_HOME/omniclaude/scripts/cron-closeout.sh >> /tmp/cron-closeout.log 2>&1 # local-path-ok: crontab example
# Via launchd (macOS)
# Create ~/Library/LaunchAgents/com.omninode.cron-closeout.plist
State layout:
.onex_state/autopilot/
cycle-state.yaml # Cross-run state (deployed versions, strikes)
cron-closeout.lock # Concurrency guard (auto-removed on exit)
runs/
closeout-2026-03-28T22-00-00Z/ # Per-run directory
A1_merge_sweep.txt # Phase output
A2_deploy_plugin.txt
A3_start_env.txt
B5_integration.txt # Hard gate output
C1_release_check.txt
C2_redeploy_check.txt
D3_dashboard_sweep.txt
pending_redeploys.txt # F30 detection result
summary.txt # Run summary
.onex_state/pipeline_checkpoints/ # Checkpoint-resume state
autopilot/
{run_id}.yaml # Per-run checkpoint with completed phases
Phases executed (each a separate claude -p invocation):
| Phase | Name | Gate? | Description |
|---|---|---|---|
| A0 | worktree-health | No | prune-worktrees.sh --execute — clean merged worktrees, skip unpushed/dirty [OMN-7021] |
| A1 | merge-sweep | No | Drain open PRs with passing CI |
| A2 | deploy-plugin | No | Copy plugin to cache |
| A3 | infra-health | No | Verify postgres, redpanda, valkey |
| B1 | runtime-sweep | Hard | Containers healthy, node dispatch alive [OMN-7002] |
| B2 | data-flow-sweep | Hard | Kafka consumers active, projections populated [OMN-7002] |
| B3 | database-sweep | Hard | Projection tables have data [OMN-7002] |
| B5 | integration-gate | Hard | Postgres + Redpanda must be healthy |
| C1 | release-check | No | Report unreleased commits per repo |
| C2 | redeploy-check | Conditional | Only if F30 detects version drift |
| D3 | dashboard-sweep | No | Non-blocking health check |
F30 pending redeploy detection: Before Phase C, the script compares git tags
in each repo against last_deploy_version in cycle-state.yaml. If any tag has
advanced beyond the recorded version, the repo is flagged for redeploy.
Circuit breaker: 3 consecutive phase failures → pipeline halts with exit code 2. Resets on any successful integration gate pass.
Lock timeout: 45 minutes. If a previous run's lock is older than this, it is treated as stale and removed.
overall_status | reason | Action |
|---|---|---|
FAIL | any | HALT — report failed surface(s), do NOT proceed to release |
UNKNOWN | NO_CONTRACT | HALT — contract missing; cannot verify integration |
UNKNOWN | INCONCLUSIVE | HALT — ambiguous probe result; cannot verify integration |
UNKNOWN | PROBE_UNAVAILABLE | CONTINUE with warning — tool not available |
UNKNOWN | NOT_APPLICABLE | CONTINUE — surface not touched |
PASS | — | CONTINUE |
There is no soft-warning path for FAIL or contract UNKNOWN. The pipeline stops.
--require-gate does NOT change this behaviour — it adds an opt-in Slack gate
after integration-sweep passes, before release begins.
After infrastructure foundation tests, run golden path declarations from
plugins/onex/skills/_golden_path_validate/declarations/close_out_smoke.json
against real Kafka (KAFKA_BOOTSTRAP_SERVERS=localhost:19092).
This proves: event published → handler processes → output event appears. Infrastructure tests alone cannot prove this.
The close-out Kafka golden path is a transport-and-handler proof, not a complete downstream content proof by itself. It complements, not replaces, database/API/rendered-output verification. The declarations should cover at least three distinct pipeline paths: one classification path, one projection path, and one display-facing path.
Failure policy: WARN (not halt) for Phase 1 rollout. Promote to hard gate after 5 consecutive passing cycles. Owner: close-out skill maintainer. Review date: 2 weeks after first deployment.
3 consecutive step failures (across Steps A0–D5) → stop immediately + Slack notify.
Halt authority vs circuit breaker:
Parallel failure counting: B1-B4 run concurrently. For circuit-breaker purposes, the entire parallel batch counts as one evaluation window, not four consecutive failures. Individual sweep failures are recorded for metrics, but the breaker evaluates "did the Phase B advisory batch fail" as a single event. This prevents one noisy parallel batch from tripping the breaker in an absurd way.
Advisory accumulation doctrine: Advisory sweeps may contribute to the circuit breaker only as evidence of broad workflow instability, not as substitutes for hard-gate authority. Breaker behavior should not allow one noisy advisory class to dominate release control unintentionally.
Failures are tracked per run. The circuit breaker does NOT persist across runs.
| Flag | Default | Description |
|---|---|---|
--mode | build | build | close-out |
--autonomous | true | No human gates in close-out sequence |
--require-gate | false | Opt into Slack HIGH_RISK gate before release |
Each headless claude -p phase inherits authorization from cron-closeout.sh:
ONEX_RUN_ID is set per run and passed to all claude -p invocations for audit trail correlation.ONEX_UNSAFE_ALLOW_EDITS=1 is set by the script for phases that need write access.--autonomous semantics: The cron script itself is the autonomous authority. Individual
claude -p invocations do not need to re-request approval — the decision to run unattended
was made at the cron/launchd level.--allowedTools set matching its needs
(e.g., read-only phases get Bash,Read, write phases get Bash,Read,Write,Edit,Glob,Grep).NEVER dequeue a PR from the merge queue. If a PR is in the merge queue (mergeStateStatus: QUEUED):
gh pr merge --disable-auto-mergeRationale: Dequeuing and re-enqueuing creates a second CI run. The concurrency group has cancel-in-progress: false, so both runs execute sequentially, wasting ~10 min per unnecessary dequeue.
Phase A — Prepare:
scripts/prune-worktrees.sh --execute: auto-clean merged worktrees, skip worktrees with unpushed commits or dirty state, skip detached HEAD and missing upstream [OMN-6867, OMN-7021]Phase B — Quality Gate:
Phase C — Ship:
Phase D — Verify:
npx claudepluginhub omninode-ai/omniclaude --plugin onexOrchestrates multi-phase project execution by dispatching dedicated persona agents for planning, execution, verification, and review. Use after spec approval for automated phase chaining.
Automates CI/CD pipeline setup with quality gates including lint, type check, tests, build, security audit, and deployment strategies. Use when setting up or modifying build and deployment pipelines.
Plans, implements, and tests code features locally using agent teams with parallel git worktrees and autonomous execution. For build requests, bugs, improvements, or task files; no deploys.