From codex-collaboration
Sequential drive/validate/act collaboration between Claude and Codex. Claude analyzes, Codex validates each finding, both models must agree before action. Use when you want iterative improvement with bilateral consensus.
How this skill is triggered — by the user, by Claude, or both
Slash command
/codex-collaboration:collaborative-loopopusThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Sequential pair-programming loop between Claude and Codex CLI. Claude PRODUCES an analysis with numbered findings, Codex VALIDATES each finding individually (CONFIRM/REJECT), Claude RE-EVALUATES the validation decisions, and only findings where both models agree proceed to implementation. After fixes, Codex reviews the changes and the loop repeats until clean.
Sequential pair-programming loop between Claude and Codex CLI. Claude PRODUCES an analysis with numbered findings, Codex VALIDATES each finding individually (CONFIRM/REJECT), Claude RE-EVALUATES the validation decisions, and only findings where both models agree proceed to implementation. After fixes, Codex reviews the changes and the loop repeats until clean.
Core principle: Claude never acts on its own unvalidated output. Every finding passes through bilateral consensus before implementation.
"collaborate with codex", "have codex review my changes", "drive and review loop", "iterative improvement", "produce-validate-act", "collaborative loop"
collaborative-loop [--max-rounds N] [--type code|plan|architecture|design] [target files...]
| Argument | Default | Description |
|---|---|---|
--max-rounds | 3 | Maximum fix-review iteration rounds |
--type | (auto-detected) | Artifact type override |
| target files | (branch diff) | Specific files to analyze |
Read ${CLAUDE_PLUGIN_ROOT}/skills/shared/prerequisites.md and execute the preflight check.
Invoke /codex:setup via the Skill tool. Two failure modes:
/codex:setup and ABORT.On success, proceed to Step 2.
After /codex:setup succeeds, verify the Codex broker can actually complete work (not just accept connections). Run a trivial Codex task:
node "${CLAUDE_PLUGIN_ROOT}/../codex/scripts/codex-companion.mjs" task --fresh "Reply with OK"
If this hangs for >60 seconds or returns empty output, the broker's app-server subprocess is likely dead while the broker process stays alive on the named pipe. Recovery:
cat <session-dir>/broker.pid (or check the broker.json state file)taskkill /PID <pid> /T /F (Windows) or kill <pid> (Unix)/codex:setup -- the companion will spawn a fresh broker with a live app-serverOnly proceed to Step 2 after the liveness check passes.
The collaborative loop requires BOTH collaborators. If Codex fails at any point during the workflow (script error, empty output):
/codex:setup, check auth, verify CLI installation)Codex tasks can legitimately run for 20+ minutes on complex analyses. Do NOT use a hard timeout. Instead, detect hangs via log staleness:
/codex:rescue, note the task's log file path from the job recordstarting or has not advanced, the task is stalledverifying, reading files), the task is healthy -- keep waitingWhen a stall is detected:
/codex:cancel or kill its PIDCodex hung twice consecutively. The broker's app-server subprocess may be dead.
Remediation: kill the broker process and re-run /codex:setup.
Read ${CLAUDE_PLUGIN_ROOT}/skills/shared/artifact-detection.md and follow the detection procedure.
--max-rounds N (default: 3)--type code|plan|architecture|design (default: auto-detect from files)If no target files provided:
main, then master, then git remote show origin defaultgit diff <base>...HEAD --name-only to find changed filesFollow the rules in artifact-detection.md:
| Signal | Type |
|---|---|
Source code extensions (.ts, .py, .js, .go, .rs, .cs, .java, etc.) | code |
*-plan*, *-tasks*, *implementation-plan* | plan |
*-architecture*, *-spec* | architecture |
Other .md in docs/ or plans/ | design |
| Mixed | default to code |
ROUND = 0
MAX_ROUNDS = <from args or 3>
ARTIFACT_TYPE = <detected>
TARGET_FILES = <resolved list>
BASE_BRANCH = <detected>
Claude analyzes target files and produces numbered findings with severity. No implementation yet -- analysis only. Do not modify any files.
For code artifacts, search available skills for the best match:
coder, code-review, feature-devunity-coder, python-coder) then general-purposeFor plan, architecture, and design artifacts, perform structured analysis against focus areas from ${CLAUDE_PLUGIN_ROOT}/skills/shared/review-domains.md. Read the file and use the focus areas for the detected artifact type.
Produce numbered findings, globally ordered by severity:
[1] [critical] [security] src/auth/login.ts:42 -- SQL injection via unsanitized input
Suggested fix: Use parameterized query instead of string concatenation
[2] [high] [correctness] src/api/handler.ts:87 -- Missing null check on response.data
Suggested fix: Add guard clause before accessing .data property
[3] [medium] [performance] lib/cache.ts:15 -- Cache has no TTL, grows unbounded
Suggested fix: Add maxAge option to cache constructor
Every finding must:
[N]critical, high, medium, or minorfile:lineHold findings in conversation context. Do not write intermediate files.
Send Claude's numbered findings to Codex for per-finding validation. Codex independently evaluates each finding and returns CONFIRM or REJECT with evidence.
IMPORTANT: Use /codex:rescue --fresh via the Skill tool -- NOT /codex:adversarial-review. Adversarial review produces its own independent findings and does not map back per-finding. Only /codex:rescue with a custom prompt supports per-finding CONFIRM/REJECT. Always pass --fresh to prevent the codex plugin from prompting the user about resuming a previous thread.
Monitor the task using the Hang Detection (Stale-Log Monitor) procedure from Step 1. Do not poll with rapid bash commands -- check every 2 minutes, and only act if the log is stale for 5 minutes.
Invoke the gpt-5-4-prompting skill patterns when composing the prompt. Structure the prompt with XML-tagged blocks:
<task> block:
Validate each of Claude's findings independently. For each finding, determine
whether it is a genuine issue (CONFIRM) or a false positive (REJECT). You must
provide evidence from the actual code or specification for each decision.
Also report any issues you find that Claude missed entirely.
Include in the prompt:
${CLAUDE_PLUGIN_ROOT}/skills/shared/review-domains.md for the detected artifact type${CLAUDE_PLUGIN_ROOT}/skills/shared/validation-format.md as <structured_output_contract>Codex returns per validation-format.md:
## Confirmed Findings
- [1] CONFIRM -- evidence from code/spec
- [3] CONFIRM -- evidence from code/spec
## Rejected Findings
- [2] REJECT -- evidence why this is a false positive
## New Findings (Missed by Driver)
- [high] [error-handling] src/api/handler.ts:91 -- Unhandled promise rejection
Suggested fix: Add try/catch wrapper
## Status
VALIDATED | PARTIALLY_VALIDATED | REJECTED
## Summary
Brief assessment (confirmed X, rejected Y, found Z new)
Claude reviews Codex's CONFIRM/REJECT decisions against the original analysis. This is the bilateral consensus gate.
| Codex Says | Claude Agrees? | Action |
|---|---|---|
| CONFIRMED | Yes | Proceed -- act on this finding in Step 5 |
| CONFIRMED | No | Flag for user -- present disagreement, ask for mediation |
| REJECTED | Yes | Drop -- both agree this is not a real issue |
| REJECTED | No | Flag for user -- present disagreement, ask for mediation |
For each CONFIRMED finding: Claude reviews whether the confirmation is sound. If Claude still agrees the finding is real, mark as proceed. If Claude now thinks Codex's confirmation was based on a misunderstanding, mark as flag.
For each REJECTED finding: Claude reviews whether the rejection is justified. If Claude agrees the finding was a false positive, mark as drop. If Claude still believes the finding is valid despite Codex's rejection, mark as flag.
For new findings from Codex: Claude evaluates each. If Claude agrees it is a real issue, mark as proceed. If Claude disagrees, mark as flag.
Flagged disagreements: Present ALL flagged items to the user with both models' reasoning. Wait for user mediation before continuing. The user may:
Only findings marked proceed (including user-mediated ones) advance to Step 5.
Never silently resolve disagreements. The entire point of bilateral consensus is that ambiguous cases get human judgment.
Filter to confirmed-and-agreed findings only. Claude implements the fixes.
run_in_background: true) when parallelism adds value -- specifically when fixes are independent and span different modules/filesLog dropped findings for the summary but do not act on them:
Dropped (both models agreed not real):
- [2] [high] [correctness] src/api/handler.ts:87 -- reason for drop
After Claude implements fixes, Codex reviews the resulting changes.
Invoke /codex:review --base <ref> via the Skill tool, where <ref> is the commit or ref before Step 5's changes. This uses the codex plugin's native review capability.
Invoke /codex:rescue --fresh via the Skill tool with a review prompt. Compose the prompt using gpt-5-4-prompting patterns:
<task>: Review the changes made to these artifacts. Evaluate whether the fixes correctly address the validated findings without introducing new issues.${CLAUDE_PLUGIN_ROOT}/skills/shared/verdict-format.md as <structured_output_contract>Codex returns per verdict-format.md:
## Status
APPROVED | MINOR_ISSUES | CHANGES_REQUESTED
## Findings
- [severity] [category] file:line -- description
Fix: concrete suggested fix
## Summary
Brief overall assessment
Parse the verdict status from Codex's response.
APPROVED -- All issues resolved. Present final summary and stop.
MINOR_ISSUES -- Only minor/informational findings remain. Log them in the summary and stop.
CHANGES_REQUESTED -- Actionable findings remain. Continue to next round:
ROUNDTrack findings across rounds. If more than 50% of findings persist (same file:line, same category) across 2 consecutive rounds:
Stall detected: X of Y findings persisted across rounds N and N+1.
These findings may require architectural changes or manual intervention:
- [finding details]
- [finding details]
Please advise: continue with modified approach, or stop here?
When ROUND >= MAX_ROUNDS:
Max rounds (N) reached. Remaining issues:
- [finding details]
- [finding details]
Present at the end of every loop run (whether clean exit, stall, or max rounds):
## Collaborative Loop Summary
Rounds completed: N
Findings resolved: X
Findings dropped (consensus reject): Y
Findings remaining: Z
### Resolved
- [1] [severity] [category] file:line -- fixed in round N
### Dropped
- [2] [severity] [category] file:line -- both models agreed not real
### Remaining (if any)
- [5] [severity] [category] file:line -- description
Claude never acts on unvalidated output. Step 3 produces analysis; Step 4 validates it; Step 4.5 requires bilateral agreement. Only then does Step 5 implement.
Both models must agree before action. The Step 4.5 decision matrix ensures that disagreements are surfaced to the user, never silently resolved.
No intermediate files. Claude holds all state (findings, validations, round tracking) in conversation context. No temp files to manage or clean up.
No self-review fallback. If Codex fails at any point, STOP. Do not substitute Claude reviewing its own work -- that defeats the purpose of bilateral validation.
Parallelism is organic. Claude uses Agent tool subagents when fixes are independent. Codex uses multi_agent = true internally when the prompt includes parallelism instructions. Neither is forced.
Disagreements go to the user. When Claude and Codex disagree on a finding's validity, the user mediates. No model overrides the other.
Do NOT use /codex:adversarial-review for validation. It produces its own independent findings instead of CONFIRM/REJECT on Claude's findings. Only /codex:rescue with a custom prompt provides per-finding validation.
Do NOT act on unvalidated output. Claude's Step 3 analysis is a proposal, not a mandate. Every finding must pass through Steps 4 and 4.5 before implementation.
Do NOT fall back to Claude-only if Codex fails. Self-review provides no additional signal. Stop the loop and report the failure.
Do NOT silently resolve disagreements. If Claude and Codex disagree on a finding, present both perspectives to the user. Never auto-resolve in favor of either model.
Do NOT forget gpt-5-4-prompting patterns. When composing any prompt for Codex (Steps 4, 6), invoke the gpt-5-4-prompting skill to use proper XML-tagged block structure. Codex performs significantly better with structured prompts.
Do NOT skip Step 4.5. The re-evaluation gate is what distinguishes this from a simple "send to Codex and trust the result" workflow. Claude's independent review of Codex's decisions catches validation errors.
Do NOT poll Codex status rapidly. Check the log file every 2 minutes, not every few seconds. 50+ bash commands polling status wastes context and provides no value. Use the Stale-Log Monitor procedure: check timestamp + tail, decide healthy or stalled, act accordingly.
Do NOT use hard timeouts for Codex tasks. Complex validations legitimately take 15-25 minutes. A hard timeout would kill healthy tasks. Instead, detect hangs via log staleness (no modification for 5 minutes while phase hasn't advanced).
npx claudepluginhub dmitriyyukhanov/claude-plugins --plugin codex-collaborationOrchestrates implement-analyze-fix loops: implements code, AI-reviews changes, fixes issues, repeats until clean or max iterations. For iterative development with quality checks.
Runs cross-LLM iterative code reviews with Codex or Gemini CLI peers, applying accepted fixes until consensus on improved code and report.
Cross-model review using OpenAI Codex to independently verify plans or code diffs, iterating up to 5 rounds. Useful for architecture decisions, non-trivial refactors, and critical config changes.