Dokimos

LLM Evaluation Framework for Java

Documentation • Getting Started • Examples • Issues

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

JUnit integration: Run evaluations as parameterized tests in your existing test suite.
Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator.
Dataset support: Load test cases from JSON, CSV, or define them programmatically.
CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Java

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Kotlin

val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true
println("Score: ${result.score()}")     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {
    name = "Correctness"
    criteria = "Is the answer correct and complete?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}

class QaTests {
    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa.json")
    fun testQAResponses(example: Example) {
        val response = assistant.chat(example.input())
        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)
    }
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

evaluate-agent

Popularity

What's Inside

README

Dokimos

Why Dokimos?

Quick Start

Run a standalone evaluator

Java

Kotlin

Write a JUnit test

Java

Kotlin

Evaluate a dataset in bulk

Java

Confidence

Similar Plugins

evaluate-koog

evalview

agentic-usability

promptfoo-evals

DeepEval

agent-validator

More by dokimos-dev

evaluate-spring-ai

create-dataset

create-evaluator

evaluate-langchain4j

create-tests

Popularity

Health & Quality

More by dokimos-dev

evaluate-spring-ai

create-dataset

create-evaluator

evaluate-langchain4j

create-tests

Similar Plugins

evaluate-koog

evalview

agentic-usability

promptfoo-evals

DeepEval

agent-validator