From kibana-testing-tools
Review @kbn/evals code changes for alignment with the "Future of @kbn/evals" vision document. Checks that changes follow the trace-first, Elastic-native direction, use correct ownership boundaries, respect the data model and evaluation entry points, and avoid deepening Phoenix coupling. Use when reviewing PRs, commits, or planned work touching @kbn/evals, evaluation suites, evaluators, the evaluation data model, CI eval pipelines, or the golden cluster. Also triggers on "review evals", "check evals alignment", "evals vision review".
How this skill is triggered — by the user, by Claude, or both
Slash command
/kibana-testing-tools:kbn-evals-vision-reviewerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review any `@kbn/evals` work against the strategic vision to ensure changes move the
Review any @kbn/evals work against the strategic vision to ensure changes move the
framework toward its intended future rather than away from it.
@kbn/evals (grep for kbn-evals, kbn/evals,
evaluation suite paths, evaluator files, executor clients, dataset definitions)KBN_EVALS_EXECUTOR=phoenix
toggle, but new features should target the in-Kibana executor path.@kbn/evals) and solution-owned suites? The framework should provide
primitives; solutions should own their evaluators, datasets, and reporting.trace_id fields in task
and evaluator output.kibana-evaluations datastream
with the documented schema? New fields should follow the established naming conventions.@kbn/evals package) should provide
orchestration/runtime, data model, trace-first evaluator primitives. Solution-specific
logic belongs in solution evaluation suites.Produce a structured review with:
## @kbn/evals Vision Alignment Review
### Summary
[1-2 sentence overall assessment: aligned / partially aligned / misaligned]
### Aligned With Vision
- [List specific aspects that correctly follow the strategic direction]
### Concerns
- **[Category from checklist]**: [Description of the concern and which
vision principle it conflicts with]
- **Recommendation**: [Specific suggestion to realign]
### Opportunities
- [Optional: ways the change could go further in supporting the vision]
"The primary objective is to elevate @kbn/evals from an offline evaluation runner into the foundational layer for all LLM quality assurance in Kibana."
"Strategic objective: Treat OpenTelemetry traces in Elasticsearch as the primary evidence of agent/LLM behavior, and build evaluation orchestration, dataset management and reporting as part of workflows within the Elastic Stack."
"The evaluator contract is centered around OpenTelemetry traces stored in Elasticsearch. This aligns evaluation with how we already observe production behavior 'online'."
"Evaluation datasets define what we are explicitly measuring. For transparency and repeatability, the default should be that datasets are defined in code, versioned and reviewed in the repository alongside the suite."
"This layer should be independent of how an evaluation is triggered (CI/offline vs in-tool), so that evaluator behavior and stored results remain consistent across all use cases."
"We're proposing an Elastic-native evaluation solution that builds on top of our Observability product."
npx claudepluginhub patrykkopycinski/patryks-treadmill-claude-plugins --plugin kibana-testing-toolsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.