From arcforge
Use when optimizing any measurable metric through autonomous hypothesis-driven experimentation — build times, algorithm efficiency, prompt quality, model performance, or any target with a numeric signal.
How this skill is triggered — by the user, by Claude, or both
Slash command
/arcforge:arc-researchingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Autonomous iterative research: define a measurable optimization target, establish a baseline, then run a hypothesis-driven experiment loop until interrupted.
Autonomous iterative research: define a measurable optimization target, establish a baseline, then run a hypothesis-driven experiment loop until interrupted.
Core principle: "Fixed judge + free player" — the evaluation method is immutable (the judge), while the implementation is free to change (the player). By locking what you measure, you prevent moving goalposts during optimization.
git reset --hard HEAD~1 immediately.Agent proposes, human reacts, refine iteratively, then lock.
Step 1: Analyze Target
Step 2: Propose Draft Contract
research-config.md covering all 6 sections (below)Step 3: Refine with Human
Step 4: Lock the Contract
research-config.md to disk# Research Config: {target}
## Scope
CAN modify: {files/dirs the agent may change}
CANNOT modify: {files/dirs that are off-limits}
## Goal
Metric: {name, e.g., "build_time_seconds", "val_bpb"}
Direction: {lower-is-better | higher-is-better}
Target: {optional, e.g., "< 30s" or "none"}
## Strategy
Hypothesis playbook: {domain-specific approaches, ordered by likelihood}
Research sources: {docs URLs, reference implementations, config files}
First moves: {2-3 concrete experiments after baseline}
## Evaluation
Run command: {exact shell command, e.g., "npm run build 2>&1"}
Extract metric: {grep pattern, e.g., "grep -oP 'Time: \K[\d.]+' build.log"}
Timeout: {seconds per experiment}
Trials: {1 | 3 | 5 — runs per experiment; default 1 if omitted}
Aggregation: {median | mean — default median}
## Constraints
{secondary considerations, e.g., "keep memory under 4GB"}
## Autonomy
Mode: {run-until-interrupted | run-N-times | run-until-target}
## Simplicity Criterion
{Prefer simpler code when results are similar. Removing code for equal results is a win. "0.1% + 20 hacky lines? No." "0.1% from deleting code? Yes." "No improvement but simpler? Keep."}
| Judge Type | Signal Stability | Recommended Trials |
|---|---|---|
| Deterministic (build time, algorithm) | Stable ±2% | 1 |
| Semi-stochastic (E2E tests, flaky metrics) | Varies ±10% | 3 |
| Stochastic (LLM-graded eval, model behavior) | Varies ±30% | 5 with median |
The contract author decides at lock time, not the loop at runtime. If Trials is omitted from an existing contract, default to 1.
git checkout -b research/{tag}results.tsv with status baseline — do NOT commit results.tsv (keep it untracked so experiment history survives resets)node "${ARCFORGE_ROOT}/scripts/cli.js" research dashboard --results results.tsv --config research-config.mdThis is the heart of the skill. NEVER STOP — run until interrupted or the stop condition from the contract is met.
LOOP (until stop condition):
1. READ STATE — git log, results.tsv, research-config.md
2. HYPOTHESIZE — pick a direction based on results so far
3. IMPLEMENT — modify files within declared scope only
4. COMMIT — git commit with descriptive message
5. RUN — execute command `trials` times → run-1.log, run-2.log, ... (never tee or raw stdout)
6. EXTRACT — grep metric from each log, compute aggregation (median/mean)
7. DECIDE — aggregated value improved? keep. Same/worse? revert. Crash? log + revert.
8. LOG — append row to results.tsv (every experiment, no exceptions)
9. ANALYZE — 3+ failures in same direction? change direction entirely
| Outcome | Action | Git | results.tsv Status |
|---|---|---|---|
| Metric improved | Keep the change | Keep commit | keep |
| Metric same or worse | Discard the change | git reset --hard HEAD~1 | discard |
| Command crashed/timed out | Log and discard | git reset --hard HEAD~1 | crash |
If 3 or more consecutive experiments fail in the same direction (e.g., all trying to reduce allocations):
Idea generation when stuck:
Two types of crashes — handle differently:
Dumb bug (typo, missing import, syntax error, off-by-one):
Fundamentally broken idea (OOM, algorithm doesn't converge, approach is wrong):
crash with the error in descriptiongit reset --hard HEAD~1Timeout: If the run exceeds the timeout, kill it and treat as a fundamentally broken idea.
Never count crashes toward the "3 failures → change direction" rule — crashes indicate broken code, not a bad hypothesis.
Long-running research burns context. Protect it:
command > run.log 2>&1 — never tee, never raw stdoutgrep "metric_pattern" run.log — never cat run.logtail -n 50 run.log — read stack traces, not full logsWhen the loop ends (interrupted, target reached, or max iterations):
results.tsvTab-separated values with header row:
commit metric_value status description
a1b2c3d 0.997 baseline Initial baseline measurement
b2c3d4e 0.891 keep Reduced learning rate by 50%
c3d4e5f 0.912 discard Added dropout layer 0.3 — regression from 0.891
d4e5f6g NaN crash Segfault in custom allocator — timeout after 300s
NaN for crashesbaseline, keep, discard, crashGit status: Keep results.tsv untracked. If committed, git reset after failed experiments will erase the log. The TSV is your persistent memory — it must survive resets.
If the agent is interrupted and resumes in a new session:
research-config.md → if exists, contract is already locked (skip Phase 1)results.tsv → understand all prior experimentsgit log → understand current code stateresearch/ → confirm research contextNever:
If results are suspicious:
| Rationalization | What to Do Instead |
|---|---|
| "The eval has a bug, let me fix it" | You're the player, not the judge. Stop and tell the human. |
| "The metric barely regressed, I'll keep it" | Binary rule: improved or not. Revert. |
| "I should ask the human about this" | You are autonomous. Decide, log reasoning, keep going. |
✓ RESEARCH COMPLETE
Target: {target name}
Baseline: {baseline value}
Best: {best value} ({improvement}% {direction})
Experiments: {total} ({kept} kept, {discarded} discarded, {crashed} crashed)
Branch: research/{tag}
Best commit: {hash}
✗ RESEARCH BLOCKED
Reason: {why the loop cannot continue}
Last experiment: {commit hash}
Suggestion: {what the human should investigate}
Before: arc-brainstorming → explore what to optimize and identify measurable targets
During: arc research dashboard for live monitoring
After: Review research/{tag} branch, cherry-pick or merge to main, run project tests
npx claudepluginhub gregoryho/arcforge --plugin arcforgeGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.