From agent-benchmark
Use when the user wants to extract benchmark test cases from Claude Code or Codex JSONL conversation history files. Triggered by phrases like "extract test case", "generate benchmark", "parse conversation history", or providing JSONL file paths for test case extraction.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-benchmark:extract-testcaseThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
从对话历史中提取理论物理 benchmark test case。
从对话历史中提取理论物理 benchmark test case。
用户提供一个或多个 JSONL 对话历史文件路径作为参数: $ARGUMENTS
使用 Python 脚本解析对话历史:
python3 -c "
import json, sys
def _read_jsonl(path):
lines = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
lines.append(json.loads(line))
return lines
def detect_format(lines):
for entry in lines:
t = entry.get('type', '')
if t in ('user', 'assistant') and 'message' in entry:
return 'claude_code'
if t in ('response_item', 'session_meta'):
return 'codex'
raise ValueError('Cannot detect conversation format')
def _parse_claude_code(lines):
msgs = []
for entry in lines:
t = entry.get('type', '')
if t not in ('user', 'assistant'):
continue
content = entry.get('message', {}).get('content')
if content is None:
continue
if isinstance(content, str):
msgs.append({'role': t, 'text': content})
elif isinstance(content, list):
parts, has_only_tool = [], True
for block in content:
if isinstance(block, dict):
if block.get('type') == 'text':
parts.append(block['text']); has_only_tool = False
elif isinstance(block, str):
parts.append(block); has_only_tool = False
if parts and not has_only_tool:
msgs.append({'role': t, 'text': chr(10).join(parts)})
return msgs
def _parse_codex(lines):
msgs = []
for entry in lines:
if entry.get('type') != 'response_item':
continue
payload = entry.get('payload', {})
role = payload.get('role')
if role not in ('user', 'assistant'):
continue
parts = []
for block in payload.get('content', []):
if isinstance(block, dict) and block.get('type') in ('input_text', 'output_text'):
parts.append(block['text'])
if parts:
msgs.append({'role': role, 'text': chr(10).join(parts)})
return msgs
def parse_conversation(*paths):
all_msgs = []
for p in paths:
lines = _read_jsonl(p)
fmt = detect_format(lines)
all_msgs.extend(_parse_claude_code(lines) if fmt == 'claude_code' else _parse_codex(lines))
return all_msgs
def format_messages(msgs):
parts = []
for m in msgs:
label = 'User' if m['role'] == 'user' else 'Assistant'
parts.append(f\"{label}:\n{m['text']}\")
return (chr(10)*2 + '---' + chr(10)*2).join(parts)
paths = '''$ARGUMENTS'''.strip().split()
if not paths:
print('Error: 请提供至少一个 JSONL 文件路径', file=sys.stderr)
sys.exit(1)
messages = parse_conversation(*paths)
print(format_messages(messages))
"
将上述脚本的输出保存下来供后续分析使用。
阅读清洗后的对话文本,识别其中的物理问题和研究任务。判断:
concept — 概念理解(对称性、相变、重整化等)derivation — 理论推导(从假设到结果的解析推导)literature — 文献理解(论文中的物理论证)simulation — 模拟代码(蒙特卡洛、分子动力学等)transfer — 方法迁移(将方法适配到另一个模型)multi_step — 综合任务(涉及多种能力的复合任务)根据判断结果,为每个 test case 生成 YAML 文件。
单问题模板 (question):
id: "Q_XXX"
title: "简明标题"
category: "concept|derivation|literature|simulation|transfer|multi_step"
difficulty: "L1|L2|L3"
test_mode: "llm_single|agent_multi_step"
problem: |
从对话中提炼的核心问题描述。
包含足够背景信息,但不包含答案。
明确约束条件和期望输出格式。
input: |
对话中提供的公式、代码片段、参数等辅助材料。
如无则删除此字段。
context_files:
- file: "参考文件名.md"
instruction: "阅读指引"
source:
research_area: "研究领域"
when_encountered: "YYYY-MM"
real_scenario: "从对话中推断的实际场景"
tags: ["tag1", "tag2"]
复合任务模板 (composite):
id: "COMP_XXX"
title: "任务标题"
type: "composite"
category: "multi_step"
difficulty: "L3"
problem: |
完整的复合任务描述。明确约束条件和期望输出格式。
context_files:
- file: "参考文件名.md"
instruction: "阅读指引"
steps:
- id: "step1"
title: "子任务标题"
problem: |
子任务描述。
test_mode: "llm_single|agent_multi_step"
timeout: 600
- id: "step2"
title: "子任务标题"
problem: |
子任务描述。
handoff:
from: "step1"
inject: "{{step1.answer}}"
files: ["output_file.py"]
test_mode: "agent_multi_step"
timeout: 900
tags: ["tag1", "composite"]
expected_output 和 evaluation 字段(这些后续人工添加)problem 字段应是从对话中提炼的干净问题描述,不包含模型的回答input 字段放对话中提供给模型的辅助材料(公式、代码等)context_files 中标注difficulty 判断标准:
test_mode 判断: 如果问题需要写代码或多步操作,用 agent_multi_step;纯理论问答用 llm_singleexamples/ 目录examples/<id小写>.yaml 或 examples/<id小写>/task.yaml(如需参考材料)examples/<id小写>/references/ 并在其中放置参考文件向用户展示提取结果摘要:
expected_output 和 evaluation 字段npx claudepluginhub rainshed/agent-benchmark --plugin agent-benchmarkParses Claude Code JSONL conversation transcripts to extract signal from user/assistant messages, filter noise entries, link subagent files, detect session boundaries, and understand storage format.
Extracts and classifies dialog from Claude Code or Codex CLI history across 6 academic dimensions (Bloom's, Graesser, Paul & Elder, Walton, Long & Sato, Graesser generation). Useful for conversation pattern analysis.
Converts Claude Code JSONL conversation logs into Markdown with project-aware indexing and retrospective analysis guidance. Supports incremental or full re-conversion by project.