From omni-analytics
Runs judged eval runs against AI models via the Omni CLI. Define prompt sets, execute against branches, and read semantic verdicts from the accuracy judge.
How this skill is triggered — by the user, by Claude, or both
Slash command
/omni-analytics:omni-ai-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Omni ships a first-class eval system (the **AI Hub** → **Prompt sets** and **Eval runs**). This skill drives it through the Omni CLI: define a reusable **prompt set**, start a judged **eval run** against a model or branch, and read per-prompt verdicts from Omni's built-in **accuracy judge**.
Omni ships a first-class eval system (the AI Hub → Prompt sets and Eval runs). This skill drives it through the Omni CLI: define a reusable prompt set, start a judged eval run against a model or branch, and read per-prompt verdicts from Omni's built-in accuracy judge.
Prefer this native system over building your own harness. The judge scores each answer semantically against the full agent conversation — it does not require golden query JSON, and it evaluates the whole agentic workflow (topic selection, queries, results, and the final written answer), not just generated query structure.
Tip: Use
omni-ai-optimizerto improve scores after finding failures,omni-model-builderto apply context changes on a branch before A/B testing, andomni-model-explorerto discover topics and fields when writing prompts.
# Verify the Omni CLI is installed — if not, ask the user to install it.
# See: https://github.com/exploreomni/cli#readme
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
# Verify the CLI has the eval commands. If this errors with "unknown command",
# the binary is stale — ask the user to update it (the ai-eval group is generated
# from the bundled API spec).
omni ai-eval --help >/dev/null 2>&1 || echo "ERROR: 'omni ai-eval' missing — update the Omni CLI."
# Show available profiles and select the right one — running against the wrong
# instance silently evaluates the wrong model.
omni config show
# If multiple profiles exist, ask the user which to use:
omni config use <profile-name>
You also need the model ID of a shared model to evaluate. Evals require at least Querier access on that model, and at least one topic optimized for AI. See the Evals guide for concepts and prompt-set best practices.
| Concept | What it is |
|---|---|
| Prompt set | A reusable, named list of up to 25 natural-language prompts, scoped to one model. Each prompt may carry an optional expectation — a reference answer the judge scores against. Lives server-side; create one per topic, per release, or for regression coverage. |
| Eval run | Executes a prompt set against a model branch (or main). Each prompt runs as a full async agentic AI job (the same engine as production Blobby), then the accuracy judge scores the result. |
| Accuracy judge | A fixed judge model that reads the evaluated AI's full conversation and returns a pass/fail verdict per prompt, plus confidence and a rationale. It targets high-impact analysis errors (hallucinations, date/time filtering, row-limit handling, mental math, period-over-period mistakes, wrong topic). It does not grade wording or formatting. |
All commands accept -o json (or --compact) to force structured output for parsing, and --profile <name> / --branch-id style global flags. Run omni ai-eval <command> --help for the full flag list.
omni ai-eval prompt-sets-create --compact --body '{
"model_id": "your-model-id",
"name": "Orders regression",
"slug": "orders-regression",
"description": "Core revenue + orders coverage",
"prompts": [
{ "prompt_text": "Show me revenue by month" },
{ "prompt_text": "What are the top 5 products by revenue?",
"expectation": "The top product by revenue should be Aniseed Syrup." },
{ "prompt_text": "How many orders were placed last week?" }
]
}'
The response includes the created prompt_set.id (a UUID) — capture it for the run.
| Field | Required | Notes |
|---|---|---|
model_id | Yes | Shared model UUID the set is bound to |
name | Yes | ≤ 255 chars |
slug | Yes | Unique per model_id; must match ^[a-z][a-z0-9-]*$ |
description | No | ≤ 1024 chars |
prompts[] | No | ≤ 25; each needs prompt_text (≤ 8000 chars), optional expectation (≤ 16000 chars) |
Find or update an existing set instead of recreating:
omni ai-eval prompt-sets-list --model-ids your-model-id --compact # discover sets + ids
omni ai-eval prompt-sets-get <promptSetId> --compact # full set with prompts
prompt-sets-update replaces the entire prompts list: omitted prompts are deleted, entries with no id are created, entries with a matching id are updated in place. To add one prompt, send the full desired list (existing prompts carry their id).
omni ai-eval prompt-sets-update <promptSetId> --compact --body '{
"prompts": [
{ "id": "<existing-prompt-id>", "prompt_text": "Show me revenue by month" },
{ "prompt_text": "Revenue by month, last 12 months only" }
]
}'
expectation only when a prompt has a known answer worth pinning: a value or ranking ("top product should be Aniseed Syrup"), a direction ("revenue should be up YoY"), or a required breakdown. The judge treats a material divergence (wrong numbers, wrong direction, missing required result) as a failure but ignores wording/formatting differences. With no expectation, the judge decides whether the answer is correct on its own terms. The expectation is shown only to the judge — the evaluated AI never sees it.# Against main:
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "Baseline on main"
}'
# Against a branch (measures a model-context change before promotion):
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "After adding ai_context to order_items",
"run_config": { "branch_id": "<branchId>" }
}'
The response returns run.id and job_count (one agentic job per prompt). Omit run_config.branch_id to run against the live shared model.
Concurrency cap: at most 2 eval runs in flight at once. A
429means the per-user active-run cap is reached — wait for an in-flight run to finish or cancel one. A503means eval is paused for the org. Checkruns-listbefore launching.
omni ai-eval runs-get <runId> --compact
Poll with backoff (e.g. 5s, 10s, 20s) until the run's status is terminal — COMPLETE or CANCELLED. Track progress with each result's agentic_job.state (QUEUED → EXECUTING → COMPLETE/FAILED). Don't hammer the endpoint.
runs-get returns results[], one row per prompt:
{
"run": {
"status": "COMPLETE",
"branch_id": null,
"results": [
{
"prompt": "What are the top 5 products by revenue?",
"score": 1,
"error_reason": null,
"cost": 0.0021,
"scoring_cost": 0.0004,
"timing_ms": 4321,
"agentic_job": { "state": "COMPLETE", "conversation_id": "conv-uuid", "id": "job-uuid" }
}
]
}
}
| Field | Meaning |
|---|---|
score | Judge verdict for the prompt — pass = 1, fail = 0 |
error_reason | Set when the underlying agentic job failed |
cost / scoring_cost | LLM cost (USD) for the answer vs. for judging it |
timing_ms | Total AI time in ms — all LLM processing and tool calls (matches the "AI time" column in the UI), not wall-clock duration |
agentic_job.conversation_id | Open this chat to read the judge's full verdict, confidence, and rationale |
Overall accuracy = the pass rate (mean of score across results). Report it with the per-prompt breakdown, and for any failure, point to the conversation_id so the user can read why the judge failed it — that rationale is where the actionable signal lives.
Eval run "Baseline on main" — 9/12 passed (75.0%)
✗ "Revenue by quarter" — judge: summed a row-limited result as a total
✗ "Top products this year" — judge: date filter used calendar instead of fiscal year
✗ "Churn rate by segment" — agentic job FAILED (error_reason)
(open each conversation_id for the full rationale)
The core workflow for measuring whether a model change helps. Run the same prompt set twice — once on main, once on the branch — then compare.
omni-model-builder to apply ai_context, fields, joins, ai_settings, etc. on a branch).runs-create with no run_config → baseline run on main.runs-create with run_config.branch_id → branch run.COMPLETE, then diff per-prompt verdicts.A/B: main vs branch/new-context (prompt set: orders-regression)
main branch Δ
Accuracy: 75.0% 91.7% +16.7%
Prompt credits: 0.024 0.026 +0.002
Regressions (passed on main, failed on branch):
- rev-by-quarter
Improvements (failed on main, passed on branch):
- top-products-this-year
- churn-rate-by-segment
Always check for regressions, not just net improvement. A higher overall pass rate can still hide a prompt that newly broke — an
ai_contextchange that helps most prompts may conflict with one. Call out any prompt that passed on main but fails on the branch.
Notes for an auditable comparison:
branch_id used and confirm the branch run's run.branch_id matches it — don't claim the branch was exercised without that.expectation later won't change how earlier runs were scored.omni ai-eval runs-list --prompt-set-id <promptSetId> --compact # runs for a set, newest first
omni ai-eval runs-cancel <runId> # cancel an in-flight run (also archives it)
omni ai-eval runs-archive <runId> # archive a finished run
omni ai-eval runs-unarchive <runId> # restore an archived run
omni ai-eval prompt-sets-archive <promptSetId> # archive a set (cancels its in-flight jobs)
omni ai-eval prompt-sets-unarchive <promptSetId> # restore an archived set
Archiving is a soft delete — sets and runs are preserved and can be restored. Cancelling a run marks it CANCELLED and archives it; use runs-unarchive to surface it in the default list again. Use --archived true on the *-list commands to see archived items.
error_reason set means the agentic job itself failed (it never produced an answer to score); treat that separately from a judge score of 0.omni models yaml-get <modelId> --compact and omni models validate <modelId>. Branch runs already pin the change to a branch.ai_context, sample_queries, field metadata)npx claudepluginhub exploreomni/omni-agent-skills --plugin omni-integrationsGenerates 20 test cases (15 happy path + 5 edge) for AI features in spreadsheet format using PM-Friendly Evals. Launches simple eval workflow with optional Linear project.
Designs, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.
Manages AI observability evaluations — inspect, run, debug, and summarize Hog (deterministic) and LLM-judge (prompt-based) evaluators against generations.