From workbuddy
Monitor the workbuddy agent pipeline, detect stuck issues, zombie processes, and missed redispatches. Works for both serve mode and distributed (coordinator+worker) mode. Use when the user says 'monitor the pipeline', 'check workbuddy status', '监工', '看看跑完了没', or asks to watch issue progress.
How this skill is triggered — by the user, by Claude, or both
Slash command
/workbuddy:pipeline-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Interactive skill that watches the workbuddy pipeline, diagnoses common
Interactive skill that watches the workbuddy pipeline, diagnoses common failure modes, and applies safe fixes. Works for both serve mode (single process) and distributed mode (coordinator + workers).
Before monitoring, determine which mode workbuddy is running in:
# Check for coordinator process
ps aux | grep "workbuddy coordinator" | grep -v grep
# Check for serve process
ps aux | grep "workbuddy serve" | grep -v grep
# Check for worker processes
ps aux | grep "workbuddy worker" | grep -v grep
workbuddy serve process → use CLI tools + serve logworkbuddy coordinator + one or more workbuddy worker → use HTTP API + DB queries + process monitoringGitHub issue labels — the ground truth of pipeline state
gh issue view N -R Owner/Repo --json labels,state,comments \
--jq '{labels: [.labels[].name], state: .state, comments: (.comments | length)}'
Agent processes
ps aux | grep -E "codex.*exec|claude.*exec" | grep -v grep
Session logs — what the agent is actually doing
# Find session directories
ls -lt .workbuddy/sessions/
# Check latest activity in a session
tail -3 .workbuddy/sessions/session-<ID>/codex-exec.jsonl | \
python3 -c "import sys,json; [print(json.loads(l).get('item',{}).get('command','')[:100] or json.loads(l).get('item',{}).get('type','')) for l in sys.stdin]"
Task queue (CLI)
./workbuddy status --tasks
Recent events (CLI)
./workbuddy status --events --since 10m
Automated diagnosis (CLI)
./workbuddy diagnose
./workbuddy diagnose --fix # auto-apply safe fixes
Coordinator health
curl -s http://coordinator:8081/health
# Returns: {"repos":N,"status":"ok"}
Registered repos
curl -s -H "Authorization: Bearer $TOKEN" http://coordinator:8081/api/v1/repos | python3 -m json.tool
Task queue via DB — most reliable way to see task state
sqlite3 .workbuddy/workbuddy.db \
"SELECT agent_name, status, worker_id, lease_expires_at, created_at
FROM task_queue WHERE repo='Owner/Repo' ORDER BY created_at DESC LIMIT 10;"
Events via DB — the most complete audit trail
sqlite3 .workbuddy/workbuddy.db \
"SELECT type, issue_num, substr(payload,1,80), ts
FROM events WHERE repo='Owner/Repo'
ORDER BY ts DESC LIMIT 15;"
Key event types: state_entry, dispatch, completed, worker_registered,
dispatch_blocked_by_dependency, poll_cycle_done
Coordinator log — check for state transitions and dispatch events
# If running in background, check the output file
cat /path/to/coordinator-output.log
# Key patterns to look for:
# [statemachine] state entry detected
# [statemachine] clearing prior inflight group
# [coordinator] task heartbeat DB update failed
Worker log — check for task pickup and agent launch
cat /path/to/worker-output.log
# Key patterns:
# [codex-debug] args=... env=... prompt=... (agent started)
# [worker] received signal terminated (clean shutdown)
workbuddy labelstatus:developing but no task in queue.gh issue edit N -R Owner/Repo --add-label workbuddy
running with expired lease (distributed mode)status=running, lease_expires_at in the past, no codex process.sqlite3 .workbuddy/workbuddy.db \
"SELECT id, agent_name, status, lease_expires_at FROM task_queue
WHERE status='running' AND lease_expires_at < datetime('now');"
sqlite3 .workbuddy/workbuddy.db \
"UPDATE task_queue SET status='completed', completed_at=CURRENT_TIMESTAMP
WHERE id='<task-id>';"
# Then restart the worker process
status=pending task but worker is idle.running task with earlier created_at blocks the claim query.sqlite3 .workbuddy/workbuddy.db \
"SELECT id, agent_name, status, created_at FROM task_queue
WHERE repo='Owner/Repo' AND status IN ('pending','running')
ORDER BY created_at;"
status:blocked.## Acceptance Criteria section, then:
gh issue edit N -R Owner/Repo --remove-label 'status:blocked' --add-label 'status:developing'
gh issue edit N -R Owner/Repo --remove-label 'status:developing' --add-label 'status:reviewing'
reviewing → developing../workbuddy cache-invalidate --repo Owner/Repo --issue N
# Check JSONL staleness
for f in $(find .workbuddy -name "codex-exec.jsonl" -newer /tmp/workbuddy-worker*.log); do
age=$(( $(date +%s) - $(stat -c '%Y' "$f") ))
if [ $age -gt 600 ]; then echo "STALE ($age s): $f"; fi
done
# Confirm no child processes
pstree -p <codex-pid> # only threads = stuck
kill -9), clean up stale running tasks in DB, start new worker.sqlite3 .workbuddy/workbuddy.db \
"INSERT INTO task_queue (id, repo, issue_num, agent_name, role, runtime, status, created_at, updated_at)
VALUES ('retry-N-$(date +%s)', 'Owner/Repo', N, 'dev-agent', 'dev', 'codex', 'pending', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);"
go build -o workbuddy . — new workers auto-create worktrees at .workbuddy/worktrees/issue-N/.# Poll issue state every 60s
watch -n 60 "gh issue view N -R Owner/Repo --json labels --jq '[.labels[].name]'"
# Track codex activity
watch -n 10 "wc -l .workbuddy/sessions/session-*/codex-exec.jsonl 2>/dev/null"
until echo "$(gh issue view N -R Owner/Repo --json labels --jq '[.labels[].name]')" | grep -q "status:done"; do
sleep 30
done
echo "ISSUE_COMPLETED"
watch -n 15 "sqlite3 .workbuddy/workbuddy.db \
'SELECT agent_name, status, worker_id FROM task_queue
WHERE repo=\"Owner/Repo\" ORDER BY created_at DESC LIMIT 5;'"
Detect mode (serve vs distributed)
│
├─ Check GitHub labels for target issues
│ ├─ All status:done → Report completion. Stop.
│ ├─ status:developing/reviewing with active agent process → Normal. Monitor.
│ └─ Label unchanged for >10 min with no agent process → Investigate below
│
├─ Check task queue (CLI or DB)
│ ├─ Task running + codex process alive → Normal. Wait.
│ ├─ Task running + no codex process → Result submission failed (fix B)
│ ├─ Task pending + worker alive → Worker may be re-claiming stale task (fix C)
│ └─ No pending tasks + labels unchanged → Cache stale or missing workbuddy label
│
└─ Check logs (coordinator/worker/serve)
├─ "state entry detected" → Dispatch happened, check worker
├─ "heartbeat DB update failed" → Task already completed/released
└─ No recent log entries → Poller or worker may be stuck
## Pipeline Status Report
| Issue | Repo | Status | Agent | Retry | Notes |
|-------|------|--------|-------|-------|-------|
| #37 | AegisLab | reviewing | review-agent | 2/3 | In progress |
| #88 | workbuddy | developing | dev-agent | 0/3 | Started |
**Mode:** distributed (coordinator:8081 + 2 workers)
**Repos:** Lincyaw/workbuddy, OperationsPAI/AegisLab
**Agent processes:** 1 (codex)
**Interventions:** Fixed stale task for #37 dev-agent (bug #88)
Optional: <owner/repo> <issue-numbers...>
Example: /pipeline-monitor OperationsPAI/AegisLab 37 42
If omitted, monitor all repos and all open workbuddy-labeled issues.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
npx claudepluginhub lincyaw/workbuddy --plugin workbuddy