By dokimos-dev
Set up Dokimos evaluations for AI agents to validate tool calls, check correctness and task completion, detect argument hallucinations, and assess tool definition quality across any agent framework.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-agentSet up evaluation of Spring AI applications using Dokimos. Provides judge creation and type conversion via SpringAiSupport, with @SpringBootTest integration for evaluations in CI.
Create evaluation datasets for the Dokimos LLM evaluation framework in JSON, CSV, or JSONL format. Supports simple and structured example formats with inputs, expected outputs, and metadata.
Scaffold a new Evaluator implementation for the Dokimos LLM evaluation framework. Creates evaluator classes extending BaseEvaluator with the builder pattern, supporting both simple evaluators and LLM-judged evaluators using JudgeLM.
Set up evaluation of LangChain4j applications and RAG pipelines using Dokimos. Provides task and judge creation via LangChain4jSupport, with evaluators for faithfulness, contextual relevance, and hallucination.
Scaffold eval-driven tests using dokimos-junit. Creates JUnit parameterized tests with @DatasetSource and Assertions.assertEval() for running Dokimos evaluations as unit tests in CI.
Set up evaluation of Koog AI agents using Dokimos. Wires Koog agents as the system under test or as LLM judges via KoogSupport utilities, with Kotlin DSL support.
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
SDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites
Teaches AI coding agents to create promptfoo eval suites with deterministic assertions, provider configs, and best practices
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
A CLI tool for validating AI coding agents