From factory-ops
Debugs factory production issues by checking status, pulling logs, diagnosing root causes, and suggesting fixes. Auto-activates on phrases like 'debug factory' or 'factory down'.
How this skill is triggered — by the user, by Claude, or both
Slash command
/factory-ops:factory-debugThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnose factory issues on production (roxabituwer). Runs from the **local** machine
Diagnose factory issues on production (roxabituwer). Runs from the local machine
(~/projects/roxabi-factory) — all production access is via make remote and SSH.
Production runs Podman Quadlet (rootless, systemd --user units).
Let:
H := DEPLOY_HOST (from ~/projects/roxabi-factory/.env)
units := {factory-hub, factory-telegram, factory-discord, factory-nats}
Σ := severity (🔴 down | 🟡 degraded | 🟢 healthy)
pat := known error patterns (see §Known Patterns)
| Pattern | Root Cause | Fix |
|---|---|---|
OperationalError: database is locked | Hub + adapter race on same SQLite DB at startup | Stagger restarts or add busy_timeout |
suspiciously fast.*dead backend | Claude CLI pool not responding | Restart factory-hub |
backend is dead — skipping guard | Stale session with dead CLI process | Restart factory-hub |
dead_backend_hits > 0 in /health/detail | Backend silently failing (fast empty returns) | Restart factory-hub (counter resets on restart) |
start-limit-hit / unit in failed | systemd gave up after 5 restarts in 60s | Inspect journal, fix root cause, systemctl --user reset-failed |
CancelledError in starlette | Normal shutdown noise — not a root cause | Ignore unless paired with other errors |
Rate limit / 429 | Anthropic API rate limit | Wait or check API key quota |
NATS.*connection / no servers available | Hub ↔ adapter NATS transport broken or factory-nats.service down | Restart factory-nats first, then factory-hub, then adapters |
IsADirectoryError.*config.toml | Quadlet bind-mount downgrade (see commit c9187fb) | Ensure inline Volume=%h/.roxabi/factory/config.toml:/app/config.toml:ro,z |
permission denied.*\.lyra | UserNS mapping mismatch (ADR-054) | Verify UserNS=keep-id:uid=1500,gid=1500 in container unit |
cd ~/projects/roxabi-factory && make remote status
Also inspect containers + nats directly:
ssh $H "podman ps -a --format '{{.Names}}\t{{.Status}}\t{{.Image}}' | grep -E 'factory-|nats'"
ssh $H "systemctl --user status factory-hub factory-telegram factory-discord factory-nats --no-pager"
∀ unit ∈ units: record state (active/running + uptime | failed | inactive). All active → Σ := 🟢; ∃ failed → Σ := 🔴; else 🟡.
Health is published on host loopback at 127.0.0.1:8443 (see deploy/quadlet/factory-hub.container PublishPort):
ssh $H "curl -s -H 'Authorization: Bearer $(cat ~/.roxabi/factory/secrets/health_secret)' http://localhost:8443/health/detail"
Parse JSON. Key fields:
dead_backend_hits > 0 → 🔴 backend silently failingqueue_size > 10 → 🟡 queue backing upcircuits with non-closed state → 🔴 circuit breaker trippedreaper_alive = false → 🟡 CLI pool reaper deadreaper_last_sweep_age > 120 → 🟡 reaper stalledIf curl fails: hub container is either not running, crash-looping before bind,
or FACTORY_HEALTH_HOST wasn't set to 0.0.0.0 inside the container (see commits c3e3d03/b2fc3bc).
Quadlet logs go to the user journal (stdout/stderr captured by systemd). Pull last 200 lines per unit — run in parallel:
ssh $H "journalctl --user -u factory-hub -n 200 --no-pager"
ssh $H "journalctl --user -u factory-hub -n 200 -p err --no-pager"
ssh $H "journalctl --user -u factory-telegram -n 200 --no-pager"
ssh $H "journalctl --user -u factory-discord -n 200 --no-pager"
ssh $H "journalctl --user -u factory-nats -n 100 --no-pager"
Equivalent via Makefile (foreground tail): make remote hub logs / telegram logs / discord logs / hub errors.
In-container structured logs (if the hub writes files to the logs volume):
ssh $H "podman exec factory-hub ls -t /home/factory/.local/state/factory/logs/ | head -10"
ssh $H "podman exec factory-hub tail -200 /home/factory/.local/state/factory/logs/<file>"
failed, check systemctl --user status <unit> for exit code + recent invocations.Present diagnosis as:
## Diagnosis
**Severity:** {Σ}
**Root cause:** {one-line summary}
**Causal chain:**
1. {first event + timestamp}
2. {cascade effect}
3. {current state}
**Evidence:**
- {log line 1}
- {log line 2}
**Health endpoint:**
- dead_backend_hits: {N}
- circuits: {state}
**Affected:** {which units / threads / users}
Present fix options via DP(A) (load ${CLAUDE_PLUGIN_ROOT}/../shared/references/decision-presentation.md). Common fixes:
| Fix | Command | When |
|---|---|---|
| Restart hub only | make remote hub reload | Dead backend, stale CLI pool |
| Restart all factory | make remote reload | DB locked, NATS broken |
| Restart specific adapter | make remote discord reload / make remote telegram reload | Single adapter failed |
| Restart NATS | ssh $H "systemctl --user restart factory-nats" | NATS connection errors |
| Clear failed state | ssh $H "systemctl --user reset-failed factory-hub" | Unit stuck in failed after start-limit-hit |
| Check DB locks | ssh $H "podman exec factory-hub fuser /home/factory/.roxabi/factory/*.db" | Persistent DB locked errors |
| Reinstall Quadlet units | make quadlet-install then ssh $H "systemctl --user daemon-reload" | Unit file drift |
| Full deploy | make deploy | Code fix needed on production |
| Rebuild + push image | make build && make push && make remote reload | Image-level fix needed |
After user picks a fix, execute it and re-run Phase 1 + Phase 2 to confirm recovery.
Verify dead_backend_hits is 0 after restart.
If the same pattern was seen before (check conversation context or memory), flag it as recurring and suggest a code-level fix:
busy_timeout pragma or serialize startupssh $H "cd $DEPLOY_DIR && git log --oneline -5") and last image (ssh $H "podman images localhost/factory --format '{{.Created}}\t{{.ID}}'")StartLimitIntervalSec / StartLimitBurst in the .container unit, or fix the underlying crash$ARGUMENTS
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub roxabi/roxabi-factory