From autoforge
Run AutoForge harness datasets and export OpenAI-friendly eval bundles from datasets or completed harness runs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoforge:harness-evals-exportThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when the user wants benchmarking, dataset runs, or eval handoff artifacts.
Use this skill when the user wants benchmarking, dataset runs, or eval handoff artifacts.
AutoForge supports:
Preferred commands:
autoforgeai harness run <dataset.jsonl>autoforgeai harness prewarm <dataset.jsonl>autoforgeai harness openai-export <dataset-or-run-path>Expected exported artifacts:
items.jsonlitem_schema.jsonbundle_manifest.jsonWhen a harness run has already completed, expect AutoForge to emit:
<run_dir>/openai_eval_bundle/Always surface:
npx claudepluginhub alyciabhz/autoforge --plugin autoforgeCreates offline evaluation tests for Output SDK workflows using @outputai/evals: verify() evaluators, YAML datasets, eval workflows, and CLI tests.
Creates, manages, and uploads evaluation datasets to LangSmith using CLI and SDK. Handles types like final_response, single_step, trajectory, RAG for LLM testing.
Creates custom LLM evaluation benchmarks using the BYOB decorator framework. Guides through dataset preparation, scorer selection, compilation, and containerization.