From create-experiment
Scaffolds Dokimos Experiments wiring datasets, tasks, evaluators, and reporters for LLM evaluation pipelines, model testing, and end-to-end eval workflows.
How this skill is triggered — by the user, by Claude, or both
Slash command
/create-experiment:create-experimentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Scaffold a Dokimos Experiment. The user will describe the evaluation goal via `$ARGUMENTS`.
Scaffold a Dokimos Experiment. The user will describe the evaluation goal via $ARGUMENTS.
dokimos-core/src/main/java/dev/dokimos/core/Experiment.javadokimos-core/src/main/java/dev/dokimos/core/Task.javadokimos-examples/src/main/java/dev/dokimos/examples/Before writing code, read these files to understand the API:
Experiment.java — the orchestratordokimos-examples/src/main/java/dev/dokimos/examples/basic/BasicEvaluationExample.java — simplest exampleAn experiment consists of four parts:
create-dataset skill)Example and returns Map<String, Object> of actual outputsEvaluator implementations that score the outputsDataset dataset = Dataset.fromJson(Path.of("src/test/resources/datasets/my-dataset.json"));
Task task = example -> {
String input = example.input();
String output = callYourLLM(input);
return Map.of("output", output);
};
List<Evaluator> evaluators = List.of(
ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build()
);
ExperimentResult result = Experiment.builder()
.name("My Evaluation")
.description("Evaluating my LLM on QA tasks")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();
System.out.println("Pass rate: " + result.passRate());
public class MyEvaluationTest {
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/my-dataset.json")
void testMyLLM(Example example) {
String actualOutput = callYourLLM(example.input());
EvalTestCase testCase = example.toTestCase(actualOutput);
List<Evaluator> evaluators = List.of(
ExactMatchEvaluator.builder().build()
);
assertEval(testCase, evaluators);
}
}
Reporter reporter = DokimosServerReporter.builder()
.baseUrl("http://localhost:8080")
.build();
ExperimentResult result = Experiment.builder()
.name("My Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.reporter(reporter)
.build()
.run();
| Method | Description | Default |
|---|---|---|
.name(String) | Experiment name | "unnamed" |
.description(String) | Description | "" |
.dataset(Dataset) | Test dataset | required |
.task(Task) | System under test | required |
.evaluator(Evaluator) | Add a single evaluator | — |
.evaluators(List) | Add multiple evaluators | — |
.reporter(Reporter) | Result reporter | NoOpReporter |
.parallelism(int) | Concurrent examples | 1 |
.runs(int) | Repeat count for variance reduction | 1 |
.metadata(String, Object) | Add experiment metadata | — |
$ARGUMENTS what the user wants to evaluateExperiment.builder()npx claudepluginhub dokimos-dev/dokimos --plugin create-experimentScaffolds JUnit parameterized tests for LLM evaluations using dokimos-junit and @DatasetSource. Enables eval-driven development with datasets as test cases in CI.
Scaffolds new Evaluator classes for Dokimos LLM evaluation framework with custom metrics, scoring functions, and grading logic for LLM outputs.
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.