From copeee-skills
Run, monitor, recover, and babysit ML evaluation experiments end-to-end. Covers the full lifecycle: intent alignment, dry-run validation, auto-launch, checkpoint recovery, real-time/cron-based self-healing monitoring, and post-completion validation. TRIGGER when: user wants to run/start/launch/rerun experiments or evaluations, monitor running experiments or check experiment logs, recover from failures like 429 quota errors or process death, set up monitoring for long-running tasks, or do a dry-run before committing to a full run. Also trigger for 'babysit', 'experiment-runner', and when user provides monitor CLI args like 'monitor experiments/X.json --log Y --expected N'. DO NOT trigger for: analyzing/comparing existing results (tables, charts, LaTeX), writing scripts to parse result files, code review or code modification, dataset creation or preprocessing, GPU/system monitoring, training job submission (use amlt-run-job), or autonomous development tasks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/copeee-skills:experiment-runner**/experiments/****/scripts/**streaming-agent/**The summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Environment placeholders** (replace these in your shell/env, or override per-call):
Environment placeholders (replace these in your shell/env, or override per-call):
${PROJECT_ROOT}— your repo root containing the eval scripts andexperiments/dir${VENV_DIR}— your Python virtualenv directory (must containbin/activate)${ML_PROXY_BASE}— your Azure / OpenAI-compatible proxy base URL (e.g.http://your-proxy:9999)${USER_HOME}— typically$HOMEThis skill was generalized from a streaming-video-agent project; commands and
phase0c_regression.pystyle script names are illustrative — substitute your own.
Manages the full experiment lifecycle in a single flow. Four modes:
/loopThese are distilled from the D-mode refactor 2026-04-14 failure and Anthropic's multi-agent guidance. Violate at your peril.
Any field that affects Phase A cache key (fps, resize_mode, clip_delta thresholds, summary_prompt_variant, cache_version) MUST match the baseline unless intentionally changed. A single silent drift → cache_hit_rate = 0% → 2h+ wasted. Always run Preflight Config Drift Check before launching an ablation experiment.
Metric-only evaluators miss "code was changed but plan was not realized" (e.g. observe() stub not wired into agent_loop). Before judging a Δmacro, verify the code diff against the plan: does the specific function referenced in the plan actually contain the intended logic? grep -n + short targeted unit test beats reading a 1000-line PR.
Subagents are stateless per invocation. They cannot hold "E0-rerun showed 1.3pp is noise, so 2pp is our floor" across turns. Decisions that require cross-file/cross-session context (go/halt, verdict thresholds, root-cause diagnosis) belong to the main agent or the human. Subagents should run narrow, assertion-backed tasks: run-this-config, parse-this-log, grep-this-function. Input/output types should be JSON-serializable.
Anthropic's own multi-agent research system parallelizes independent search/exploration; the lead agent synthesizes and decides. An ablation pipeline (E0→E1→E2…) is serial with cross-phase memory — spawning Executor/Evaluator/Orchestrator loops for it is an antipattern. For ablation: single main agent + preflight scripts + checkpoint reviews.
The plan doc was correct at write time. Before starting any phase, run: git log -5, verify referenced files still exist at the assumed paths, read the actual Segment/config.py signatures. Do NOT code from plan alone. Survey → reconcile → implement.
nohup python3 ... inside .claude/worktrees/<branch>/ dies when the worktree is cleaned or the session ends. Use setsid nohup ... < /dev/null & disown with log paths outside the worktree. Or use tmux/screen for >1h jobs.
If an experiment crashed, a second launch with the same config is almost always wrong. Diagnose first: log tail, ps, du on cache dir, process kill reason (dmesg | tail, OOM killer, SIGTERM). Rule: same error 2 times in a row → stop and investigate, do not attempt a third run.
所有默认值由分层 YAML 管理,不再允许每个实验 yaml 手抄字段。真源:streaming-agent/experiments/configs/_defaults/
合并顺序(低→高优先级):
global.yaml::mode_defaults[MODE]
→ _defaults/{benchmark}.yaml
→ _defaults/models/{model}.yaml # qwen 特例 → online
→ 用户 yaml (可选)
→ CLI flags (--protocol 等 / --set)
| Mode | 采样 | 帧数 | tools |
|---|---|---|---|
| A (raw_frames) | uniform | 48 (硬上限) | — |
| B (clip_keyframes) | CLIP 关键帧 (每 segment 全注入) | 动态 | — |
| C (segment) | 1 fps, 提问点往前退 5 秒 | 5 | — |
| D (segment_tools) | 同 C | 5 | recall+diff+timeline, max_rounds=5 |
| Benchmark | 默认 protocol | 备注 |
|---|---|---|
| OVO | offline | online 需显式 --protocol online |
| RTV | offline | 修复 2026-04-21 误跑 online 的 bug |
| LV | offline | — |
| Model | 覆盖 | 原因 |
|---|---|---|
| qwen | protocol=online, workers=1 | 本地 GPU 推理 |
# 最小命令:benchmark + mode + model + n
python3 streaming-agent/scripts/run_experiment.py \
--benchmark rtv --mode D --model gpt4o --n 140
# 显式覆盖 protocol
python3 streaming-agent/scripts/run_experiment.py \
--benchmark ovo --mode D --model qwen --protocol offline --n 100
# 仅校验(打印合并后 config,不跑 API)
python3 streaming-agent/scripts/run_experiment.py \
--benchmark rtv --mode D --model gpt4o --n 3 --validate-only
--validate-only:主 agent 必须先跑一次 validate-only,grep 确认 protocol / max_context_frames / max_tool_calls 等关键字段等于预期(匹配上表)_defaults/*.yaml:禁止在实验 yaml 里漂移默认值,否则下次对比实验会中毒output.root 必须以 experiments/{ovo,rtv,lv}/ 开头,不得散落到顶层(2026-04-22 之前的漂移已归档)"Run Gemini D on RTV", "run segment_tools on OVO"monitor -- "monitor experiments/X.json --log experiments/X.log --expected 140"validate -- "validate experiments/result.json""run E3 from phase-L1a.md", "run ablation E1→E5", "launch next phase of d-mode refactor". See Mode 5 below.Create a task list at the start to track progress:
Tasks:
- [ ] Phase 1: Align experiment intent
- [ ] Phase 2: Dry-run validation
- [ ] Phase 3: Launch experiment
- [ ] Phase 4: Real-time monitoring (or hand off to /loop for long runs)
- [ ] Phase 5: Post-completion validation
Extract from user's description:
GPT4O_USE_TRAPI=1), or dual (proxy+TRAPI parallel, dual_backend: true) -- check proxy health firstvlm_backend=claude)http://127.0.0.1:4142claude-sonnet-4.6image blocks and not Copilot SDK attachmentsmax_context_frames <= 64If ambiguous, ask one clarifying question. Then confirm:
Experiment Plan:
Gemini + segment_tools on RTV-Bench
Items: 140 | Script: rtv_bench_eval.py
Output: experiments/phase2_rtv_D_segment_tools_gemini.json
Log: experiments/gemini_D_rtv.log
API: proxy (default) | TRAPI fallback: GPT4O_USE_TRAPI=1 | dual: execution.dual_backend=true
-> Proceed with dry-run?
Before launching any experiment using Azure proxy backends, verify the proxy is alive:
curl -s -o /dev/null -w "%{http_code}" --max-time 5 ${ML_PROXY_BASE}/v1/models 2>/dev/null || echo "TIMEOUT"
200: Proxy healthy, proceed normally
dual_backend: true in YAML (or GPT4O_DUAL_BACKEND=1) to run proxy+TRAPI in parallel for ~1.5x Phase A throughputmax_concurrent_vlm to 30 (auto-splits: proxy=20, TRAPI=10)TIMEOUT / non-200: Proxy down. Options:
GPT4O_USE_TRAPI=1 — single TRAPI endpoint (lower burst limit, run serially)dual_backend: true — still works, auto circuit-breaks proxy and routes all to TRAPIAzureCliCredential, model remapped to gpt-4o_2024-11-20Run the exact command with --limit 2 to a temp file:
cd ${PROJECT_ROOT}
source ml_env/bin/activate
# 推荐(2026-04-22 起):极简 CLI
python3 streaming-agent/scripts/run_experiment.py \
--benchmark BENCH --mode MODE --model BACKEND \
--n 2 --validate-only
# Legacy shim 仍可用
python3 streaming-agent/scripts/SCRIPT.py \
--vlm-backend BACKEND --ablation-mode MODE \
DATASET_FLAGS \
--output /tmp/dryrun_$(date +%s).json \
--limit 2 --verbose 2>&1 | tail -30
Validate output:
python3 -c "
import json, sys, glob
f = sorted(glob.glob('/tmp/dryrun_*.json'))[-1]
d = json.load(open(f))
items = d.get('per_item', [])
if not items: print('FAIL: no items'); sys.exit(1)
for it in items:
if it.get('response') is None: print(f'FAIL: null response id={it.get(\"id\")}'); sys.exit(1)
print(f'PASS: {len(items)} items OK. Sample: {items[0].get(\"response\",\"?\")[:80]}')
"
After dry-run, verify backward tasks use realtime as max_time (data leak fix from commit 44d4924):
python3 -c "
import json, glob
f = sorted(glob.glob('/tmp/dryrun_*.json'))[-1]
d = json.load(open(f))
BACKWARD = {'EPM', 'ASI', 'HLD'}
for it in d.get('per_item', []):
task = it.get('task_type', '')
if task in BACKWARD:
mt = it.get('max_time')
rt = it.get('realtime')
if mt is None or (rt is not None and abs(mt - rt) > 0.1):
print(f'FAIL: Backward task {task} id={it.get(\"id\")} max_time={mt} realtime={rt}')
break
else:
print('PASS: All backward tasks use realtime as max_time')
"
If this check fails, ensure you are running code from commit 44d4924 or later.
If dry-run fails -> diagnose, fix, re-run. Do not proceed until it passes.
Use setsid + log outside worktree so the process survives worktree cleanup and session close:
# Log path MUST be outside any .claude/worktrees/ directory
LOG_PATH=${PROJECT_ROOT}/logs/experiment_$(date +%Y%m%d_%H%M%S).log
setsid nohup python3 scripts/SCRIPT.py \
--vlm-backend BACKEND --ablation-mode MODE \
DATASET_FLAGS \
--output OUTPUT_PATH --verbose \
>> $LOG_PATH 2>&1 < /dev/null &
PID=$!
disown $PID
echo "PID: $PID LOG: $LOG_PATH"
Verify: sleep 3 && ps -p $PID -o pid,ppid,pgid,sid,stat,cmd
The SID != parent-shell-SID proves setsid worked — the process won't die with the shell/worktree.
Why:
nohupalone detaches from the controlling tty but keeps the same session. When a.claude/worktrees/<branch>/is cleaned up or Claude Code session ends, its process group gets SIGTERM.setsidmoves the process into its own session so this doesn't propagate. Burned 2h on this in d-mode refactor 2026-04-14.
Workers recommendation (updated Round 15):
| Backend | precompute_workers | qa_workers | Notes |
|---|---|---|---|
| gpt4o | 20 (not 50!) | 8-12 | Actual concurrency = precompute_workers x segment_parallel_workers(8) |
| gpt4o (dual) | 25-30 | 12-15 | With dual_backend: true + max_concurrent_vlm: 30 |
| o4mini | 10-20 | 5-10 | Reasoning model, keep qa_workers <= 10 |
| o4mini (dual) | 20-25 | 5-10 | Phase A dual, Phase B TRAPI-only (keeps thinking_log) |
| o3 | 5 | 1-5 | High cost |
| gpt52 | 10-20 | 5-10 | Latest reasoning model |
| gemini | 10-20 | 10-30 | Flash/Pro |
| qwen | 1 | 1 | GPU-bound, forced serial |
Use the Monitor tool to stream the background process stdout in real-time. This is the preferred approach for most experiments because:
1. Launch experiment with Bash(run_in_background: true):
python3 scripts/SCRIPT.py ... 2>&1 | tee LOG_PATH
2. Immediately use Monitor tool on the background process
-> Each stdout line streams as an event
-> Watch for errors, progress updates, completion
3. On error -> diagnose, fix, restart (same as Mode 2 recovery logic)
4. On completion -> run Phase 5 validation, then report results
When to use: Single experiment, expected runtime < 60 minutes, user is present.
Hand off to /loop for periodic cron-based monitoring. Use this when:
Experiment launched (PID XXXXX).
Set up monitoring:
/loop 5m /experiment-runner monitor OUTPUT_PATH --log LOG_PATH --expected N
Decision rule: Default to Strategy A. Only suggest Strategy B if the user says "I'll do something else", "babysit it for me", "unattended", or if estimated runtime > 60 min.
CRITICAL: Always run this after experiment completes, before declaring success.
This phase catches silent data quality issues that caused wrong conclusions in the past (Round 4/5: null pollution; Round 13: backward data leak; Round 14: cache stale warnings).
python3 -c "
import json, sys
f = 'RESULT_FILE'
d = json.load(open(f))
items = d.get('per_item', [])
total = len(items)
# 1. Null rate check (with null_reason breakdown)
nulls = [i for i in items if i.get('response') is None]
null_rate = len(nulls) / max(1, total) * 100
print(f'[Validate] null_rate = {len(nulls)}/{total} ({null_rate:.1f}%)')
if nulls:
from collections import Counter
reasons = Counter(i.get('null_reason', 'unknown') for i in nulls)
for reason, count in reasons.most_common():
print(f' {reason}: {count}')
if null_rate > 5:
print(' WARNING: null_rate > 5% -- results may be unreliable!')
print(' Must distinguish Raw Acc vs Valid Acc in any report.')
# 2. Accuracy (raw and valid)
correct_raw = sum(1 for i in items if i.get('correct'))
valid_items = [i for i in items if i.get('response') is not None]
correct_valid = sum(1 for i in valid_items if i.get('correct'))
raw_acc = correct_raw / max(1, total) * 100
valid_acc = correct_valid / max(1, len(valid_items)) * 100
print(f'[Validate] Raw Acc = {correct_raw}/{total} ({raw_acc:.1f}%)')
print(f'[Validate] Valid Acc = {correct_valid}/{len(valid_items)} ({valid_acc:.1f}%)')
if null_rate > 5:
print(f' Delta: {valid_acc - raw_acc:.1f}pp (Valid - Raw)')
# 3. Backward max_time verification (OVO-Bench)
BACKWARD = {'EPM', 'ASI', 'HLD'}
backward_items = [i for i in items if i.get('task_type', '') in BACKWARD]
if backward_items:
leaked = [i for i in backward_items if i.get('max_time') is None]
if leaked:
print(f'[Validate] CRITICAL: {len(leaked)} backward items have max_time=None (DATA LEAK!)')
else:
print(f'[Validate] Backward max_time: {len(backward_items)} items all using realtime cutoff')
# 4. Per-task-type accuracy breakdown
from collections import defaultdict
task_acc = defaultdict(lambda: [0, 0])
for i in valid_items:
t = i.get('task_type', 'unknown')
task_acc[t][1] += 1
if i.get('correct'): task_acc[t][0] += 1
print('[Validate] Per-task accuracy:')
for t, (c, n) in sorted(task_acc.items()):
print(f' {t}: {c}/{n} ({c/max(1,n)*100:.1f}%)')
"
# 5. Cache stale warnings (from log)
echo "[Validate] Cache stale warnings:"
grep -c "v2 cache stale\|cache miss\|stale" LOG_FILE 2>/dev/null || echo " 0 warnings"
# 6. 429 / rate limit errors
echo "[Validate] 429 errors:"
grep -c "429\|rate.limit\|RateLimitError" LOG_FILE 2>/dev/null || echo " 0 errors"
# 7. Content policy errors
echo "[Validate] Content policy errors:"
grep -c "content_policy\|ContentPolicyError\|CONTENT_POLICY_BLOCKED" LOG_FILE 2>/dev/null || echo " 0 errors"
# 8. VLM rate limiter stats (if present)
echo "[Validate] Rate limiter activity:"
grep -c "VLM rate limiter\|semaphore\|backoff" LOG_FILE 2>/dev/null || echo " 0 entries"
Validation criteria:
Update task list, report completion of setup.
This runs inside /loop via cron OR when user manually asks to check status. Be fast when healthy, thorough when not.
Run these bash commands to collect experiment state:
# 1. Progress -- count completed items
python3 -c "
import json, os
for f in ['RESULT_FILE', 'CHECKPOINT_FILE']:
if os.path.exists(f):
d = json.load(open(f))
items = d.get('per_item', [])
correct = sum(1 for i in items if i.get('correct'))
nulls = sum(1 for i in items if i.get('response') is None)
null_reasons = {}
for i in items:
if i.get('response') is None:
r = i.get('null_reason', 'unknown')
null_reasons[r] = null_reasons.get(r, 0) + 1
nr_str = ', '.join(f'{k}={v}' for k,v in null_reasons.items()) if null_reasons else 'none'
print(f'Progress: {len(items)}/EXPECTED | Accuracy: {correct}/{len(items)} ({correct/max(1,len(items))*100:.1f}%) | Nulls: {nulls} ({nr_str})')
break
else:
print('No result file yet')
"
# 2. Process alive?
ps aux | grep -E "(phase0c_regression|rtv_bench_eval|run_experiment)" | grep -v grep | head -3
# 3. Log freshness + recent content
stat -c 'Log modified: %Y' LOG_FILE 2>/dev/null
echo "Now: $(date +%s)"
tail -30 LOG_FILE
# 4. Cache hit ratio (CRITICAL for ablation — if <80% in first 5min, config drifted)
HITS=$(grep -c 'Cache HIT' LOG_FILE 2>/dev/null || echo 0)
MISS=$(grep -c 'Cache MISS\|v2 cache stale' LOG_FILE 2>/dev/null || echo 0)
TOTAL=$((HITS+MISS))
[ $TOTAL -gt 0 ] && echo "Cache: HIT=$HITS MISS=$MISS ratio=$((100*HITS/TOTAL))%"
Cache ratio rule (ablation only): If after 5 minutes the ratio is <80% and this is an incremental phase that should reuse baseline cache → kill process + diff config vs baseline + fix + restart. Do not let it run 2h to find out.
Read the raw data above. Use your intelligence to classify -- don't rely on regex patterns. Consider:
null_reason fields? (content_policy vs retry_exhausted vs unknown -- different recovery strategies)HEALTHY -- Progress advancing, no errors, process alive.
[Monitor] HEALTHY | 85/140 (60.7%) | Acc: 55.3% | Nulls: 0 | Log: 30s ago
Done. Exit. Cron will invoke again in N minutes.
COMPLETE -- Items reached expected count.
[Monitor] COMPLETE | 140/140 | Accuracy: 65.7%
Run Phase 5 validation before declaring success.
Stop the monitoring loop -- experiment is done.
STALE -- Log not updated >20 min, process still alive (likely hung). -> Kill process, restart from checkpoint.
DEAD -- Process gone, items < expected. -> Restart from checkpoint. The checkpoint mechanism auto-skips completed items.
API QUOTA/RATE LIMIT -- 429, quota exhausted, rate limit errors in log. -> The VLM rate limiter (added Round 14) handles most 429s automatically with jittered backoff. -> If rate limiter is overwhelmed (429 count still rising fast), reduce workers and restart. -> Wait 120 seconds, then restart from checkpoint.
PROXY DOWN -- TCP closed, connection refused, or massive 403 errors from proxy.
-> Check proxy health: curl -s -o /dev/null -w "%{http_code}" --max-time 5 ${ML_PROXY_BASE}/v1/models
-> If proxy down: kill experiment, set GPT4O_USE_TRAPI=1, restart from checkpoint.
-> TRAPI caution: run ONE experiment at a time (low burst limit).
CODE BUG -- Traceback in log.
-> This is the expensive path. Read the traceback, find the source file, understand the bug, fix it minimally, verify with py_compile, restart.
-> Constraints: fix ONLY the crash bug. Do NOT refactor, change logic, or "improve" code.
OOM -- CUDA out of memory, killed by OOM killer.
-> Kill GPU processes (nvidia-smi -> find PID -> kill), wait 30s, restart.
HIGH NULL RATE -- Nulls rising faster than 5% of total.
-> Check null_reason field in results:
content_policy: Expected for certain datasets (Ego4D). VLM rate limiter uses placeholder summaries. Log and continue.retry_exhausted: VLM backend failures after max retries. Check API health, reduce workers.unknown / missing: Legacy code path. Check log for actual error.
-> If Azure Content Policy, this is expected -- will need Valid Acc analysis post-completion.UNKNOWN -- Something you don't recognize. -> Print the relevant log snippet. Ask the user for guidance. Do not guess.
To reconstruct the restart command, read the config field in the result JSON or the first lines of the log -- they contain the original parameters.
cd ${PROJECT_ROOT}/streaming-agent
source ${PROJECT_ROOT}/ml_env/bin/activate
kill PID 2>/dev/null; sleep 2
# Checkpoint auto-resumes -- just rerun the same command
nohup python3 scripts/SCRIPT.py ORIGINAL_ARGS >> LOG_PATH 2>&1 &
echo "[Monitor] Restarted PID $! from checkpoint (DONE/EXPECTED)"
null_reason for triage.Run this on any completed experiment result to check for known data quality issues.
python3 -c "
import json, sys
from collections import Counter, defaultdict
f = sys.argv[1]
d = json.load(open(f))
items = d.get('per_item', [])
total = len(items)
issues = []
# 1. Null rate (with reason breakdown)
nulls = [i for i in items if i.get('response') is None]
null_rate = len(nulls) / max(1, total) * 100
if null_rate > 5: issues.append(f'HIGH null_rate={null_rate:.1f}%')
if nulls:
reasons = Counter(i.get('null_reason', 'unknown') for i in nulls)
reason_str = ', '.join(f'{k}={v}' for k,v in reasons.most_common())
else:
reason_str = 'none'
# 2. Accuracy
correct_raw = sum(1 for i in items if i.get('correct'))
valid = [i for i in items if i.get('response') is not None]
correct_valid = sum(1 for i in valid if i.get('correct'))
raw_acc = correct_raw / max(1, total) * 100
valid_acc = correct_valid / max(1, len(valid)) * 100
# 3. Answer distribution (detect answer bias)
answers = Counter(i.get('response', '') for i in items if i.get('response'))
if answers:
most_common = answers.most_common(1)[0]
if most_common[1] / max(1, len(valid)) > 0.6:
issues.append(f'Answer bias: {most_common[0]} appears {most_common[1]}/{len(valid)} ({most_common[1]/len(valid)*100:.0f}%)')
# 4. Backward data leak check (OVO-Bench)
BACKWARD = {'EPM', 'ASI', 'HLD'}
backward_items = [i for i in items if i.get('task_type', '') in BACKWARD]
if backward_items:
leaked = [i for i in backward_items if i.get('max_time') is None]
if leaked:
issues.append(f'DATA LEAK: {len(leaked)} backward items have max_time=None')
# 5. Per-task-type accuracy
task_acc = defaultdict(lambda: [0, 0])
for i in valid:
t = i.get('task_type', 'unknown')
task_acc[t][1] += 1
if i.get('correct'): task_acc[t][0] += 1
print(f'Items: {total} | Nulls: {len(nulls)} ({null_rate:.1f}%) [{reason_str}]')
print(f'Raw Acc: {raw_acc:.1f}% | Valid Acc: {valid_acc:.1f}%')
if task_acc:
print('Per-task:')
for t, (c, n) in sorted(task_acc.items()):
print(f' {t}: {c}/{n} ({c/max(1,n)*100:.1f}%)')
if issues:
print(f'ISSUES:')
for iss in issues:
print(f' - {iss}')
else:
print('PASS: No data quality issues detected')
" RESULT_FILE
Use this mode when the user is running a phase-based ablation (e.g. d-mode refactor E0→E5, any series inheriting from a baseline config). Serial, cross-phase memory required.
plan_doc: path to the phase plan (e.g. docs/plans/d-mode-refactor/phase-L1a.md)baseline_config: YAML of the E0/baseline runbaseline_result: result JSON of the baseline (for Δ computation)criteria_file: path to a persistent verdict-criteria.md with the Δmacro threshold and noise floorBefore writing any code, verify the plan is still valid against current HEAD:
# 1. Commits since plan was written
git log --since="$(stat -c %y PLAN_DOC)" --oneline
# 2. Files referenced in plan — do they exist at assumed paths?
grep -oE '(src|streaming-agent)/[a-zA-Z_./]+\.py' PLAN_DOC | sort -u | while read p; do
[ -f "$p" ] && echo "OK: $p" || echo "MISSING: $p"
done
# 3. Assumed signatures — did the function the plan wants to modify still exist?
grep -n 'def observe\|def recall\|class Segment' REFERENCED_FILES
If anything is MISSING or changed, STOP, update the plan, re-confirm with user. Do not "adapt silently" — that's how drift starts.
Compare the new experiment's config against baseline, listing every field that affects cache key:
python3 -c "
import yaml, sys
cur = yaml.safe_load(open('CURRENT_CONFIG.yaml'))
base = yaml.safe_load(open('BASELINE_CONFIG.yaml'))
CACHE_KEY_FIELDS = ['model.backend','pipeline.fps','pipeline.resize_mode',
'pipeline.clip_delta_high','pipeline.clip_delta_low',
'mode.summary_prompt_variant'] # execution.cache_version 已弃用 (2026-04-21)
def get(d,p):
for k in p.split('.'): d = (d or {}).get(k)
return d
diffs = [(f, get(base,f), get(cur,f)) for f in CACHE_KEY_FIELDS if get(base,f) != get(cur,f)]
if diffs:
print('CACHE-KEY DRIFT (Phase A will re-run):')
for f,b,c in diffs: print(f' {f}: {b!r} -> {c!r}')
print('\\nIntentional? If yes, explicitly accept the cache miss. If no, revert.')
else:
print('PASS: cache keys match baseline. Phase A should hit cache 100%.')
"
Any diff → user must confirm it's intentional. Cache 现按 benchmark 分目录 (cache/{benchmark}/segments/),不再通过 cache_version 前缀隔离。
Before running the 581-item experiment, verify the code actually realizes the plan:
# Pick 2-3 key claims from the plan, grep for them in the diff
git diff BASELINE_COMMIT..HEAD -- STREAMING_AGENT_SRC | grep -A5 -B2 'observe\|num_frames\|dedup'
# Run the plan's stated behavior on ONE item (smoke)
python3 scripts/run_experiment.py --config NEW_CONFIG.yaml --limit 1 --verbose 2>&1 \
| grep -iE 'observe|num_frames|tool_call' | head -20
If the diff doesn't show the plan's claimed changes, or the smoke log shows unchanged behavior → executor work is incomplete, send back. Metric pass alone is not proof.
Before trusting any Δ, run baseline twice with identical code/config. Record variance:
baseline_variance.json:
E0: macro=0.626
E0-rerun: macro=0.613
variance_pp: 1.3
noise_floor: 2.0 # conservative — any Δ < 2pp is noise
Save to experiments/metadata/baseline_variance_<study>.json. Subsequent verdicts MUST read this file; never re-derive threshold.
criteria_file for threshold. Decision table:| |Δ| < noise_floor | noise — not a signal, log and continue or halt by policy |
| |Δ| >= noise_floor positive | signal gain — proceed to next phase |
| |Δ| >= noise_floor negative | regression — investigate before continuing |
/loop agents. Ablation is serial; loops don't share memory. See Core Principle P4.verdict-criteria.md, committed to git.cache_version. Silent cache miss will waste hours.Working directory: ${PROJECT_ROOT}/streaming-agent
Virtual env: source ${PROJECT_ROOT}/ml_env/bin/activate
Python: python3
Checkpoint files: {output_stem}_checkpoint.json, auto-skip via completed_ids set.
| Endpoint | Usage | Env var |
|---|---|---|
${ML_PROXY_BASE} | Default Azure proxy (gpt4o/o4mini/o3/gpt52) | None (default) |
TRAPI (trapi.research.microsoft.com) | Fallback when proxy down | GPT4O_USE_TRAPI=1 |
| Dual (proxy + TRAPI parallel) | Phase A throughput boost (~1.5x) | GPT4O_DUAL_BACKEND=1 or YAML execution.dual_backend: true |
TRAPI notes: Uses AzureCliCredential (run az login first), model remapped to gpt-4o_2024-11-20. Lower burst limit -- run experiments serially.
Dual-backend notes (added Round 15):
max_concurrent_vlm: 30 with dual (vs default 20) to leverage both endpointsexecution:
dual_backend: true
max_concurrent_vlm: 30
The system has a 5-layer defense against API failures:
src/common/rate_limiter.py): Semaphore + jittered backoff, max_retries=6, 429 gets 3x backoffsrc/vlm/base.py DualEndpointMixin): proxy+TRAPI parallel with least-loaded routing, circuit breaker (60s recovery), auto-fallback when one endpoint dies_execute_with_retry(), ContentPolicyError -> immediate (None, "content_policy")max_concurrent_vlm=20, vlm_max_retries=6, vlm_retry_base_delay=2.0, vlm_retry_max_delay=60, dual_backend=falseQwen backend auto-clamps max_concurrent_vlm=1 (GPU-bound).
o4-mini/o3 dual mode: Phase A uses both endpoints, Phase B locks to TRAPI (preserves thinking_log via Responses API).
scripts/run_experiment.py --benchmark {ovo,rtv,lv} --mode {A,B,C,D} --model BACKEND --n Nscripts/run_experiment.py --config YAML_FILE(legacy yaml 兼容路径,不触发 benchmark 前缀强制注入)scripts/phase0c_regression.py, scripts/rtv_bench_eval.py (接受 --vlm-backend --ablation-mode 等旧 flag)--limit N / --n N for dry-run, --verbose for detailed logging.默认值由 streaming-agent/experiments/configs/_defaults/{global,ovo,rtv,lv,models/<model>}.yaml 分层合并;禁止在用户 yaml 里硬写默认字段。
When diagnosing issues, check /experiment-guide skill for the full 35+ error case library. Top patterns:
| Pattern | Symptom | Fix | Round |
|---|---|---|---|
| Backward data leak | backward acc too high; "v2 cache stale" warnings | Use code after commit 44d4924; backward uses realtime as max_time | R13-14 |
| Null pollution | C/D mode accuracy << A mode | Check null_rate; report Valid Acc; VLM rate limiter now handles most cases | R4-5 |
| Workers too high | Massive 429 errors in log | precompute_workers <= 20 for gpt4o (actual = workers x 8) | R12-13 |
| Cache key mismatch | Phase A success but Phase B re-processes | Check if pipeline params changed (resize_mode, fps, summary_prompt_variant) | R13 |
| as_completed trap | Phase A hangs indefinitely | Use wait() + cancel (already fixed R12) | R8-12 |
| Proxy down | TCP closed, massive 403 | Use dual_backend: true (auto circuit-breaks proxy); or GPT4O_USE_TRAPI=1 (serial) | R14-15 |
| Content policy abort | Single segment kills entire video | ContentPolicyError + placeholder summary (fixed R12) | R12-13 |
| Cross-backend cache miss | Qwen online C re-runs Phase A | Cross-backend exclusive key lookup (fixed R14) | R14 |
| Appearance serialization loss | Entity appearance always None | Fixed R11; must regenerate Phase A cache with new code | R4-11 |
| Phase A fixed timeout | Large video sets always timeout | Dynamic timeout: ceil(n/w) x per_video_timeout + 120s (fixed R12) | R12 |
| TRAPI burst limit | 12,000+ 403 errors from TRAPI | Use dual_backend: true (auto-limits TRAPI share to 35%); or run ONE experiment at a time | R14-15 |
claude.streaming-agent/src/vlm/claude.py aligned with GPT-4o-style OpenAI chat-completions payloads against localhost:4142.npx claudepluginhub copeeetang/copeee-skills --plugin copeee-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.