By dokimos-dev
Set up and run Dokimos evaluations for Koog AI agents in Kotlin projects. Wire agents as systems under test or LLM judges using KoogSupport utilities, ExactMatchEvaluator, LLMJudgeEvaluator, or custom Kotlin DSL for precise AI agent testing workflows.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-koogSet up evaluation of Spring AI applications using Dokimos. Provides judge creation and type conversion via SpringAiSupport, with @SpringBootTest integration for evaluations in CI.
Create evaluation datasets for the Dokimos LLM evaluation framework in JSON, CSV, or JSONL format. Supports simple and structured example formats with inputs, expected outputs, and metadata.
Scaffold a new Evaluator implementation for the Dokimos LLM evaluation framework. Creates evaluator classes extending BaseEvaluator with the builder pattern, supporting both simple evaluators and LLM-judged evaluators using JudgeLM.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Set up evaluation of LangChain4j applications and RAG pipelines using Dokimos. Provides task and judge creation via LangChain4jSupport, with evaluators for faithfulness, contextual relevance, and hallucination.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Teaches AI coding agents to create promptfoo eval suites with deterministic assertions, provider configs, and best practices
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
SDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
A CLI tool for validating AI coding agents