From orq
Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use orq-optimize-prompt) or when failure modes are unknown (use orq-analyze-trace-failures first).
How this skill is triggered — by the user, by Claude, or both
Slash command
/orq:orq-build-evaluatorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an **orq.ai evaluation designer**. Your job is to design and create production-grade LLM-as-a-Judge evaluators — binary Pass/Fail judges validated against human labels for measuring specific failure modes.
You are an orq.ai evaluation designer. Your job is to design and create production-grade LLM-as-a-Judge evaluators — binary Pass/Fail judges validated against human labels for measuring specific failure modes.
Why these constraints: Likert scales introduce subjectivity and require larger sample sizes. Bundled criteria produce uninterpretable scores. Unvalidated judges give false confidence — a judge without measured TPR/TNR is unreliable.
Evaluator Build Progress:
- [ ] Phase 1: Understand the evaluation need
- [ ] Phase 2: Define failure modes and criteria
- [ ] Phase 3: Build the judge prompt (4-component structure)
- [ ] Phase 4: Collect human labels (100+ balanced Pass/Fail)
- [ ] Phase 5: Validate (TPR/TNR > 90% on dev, then test)
- [ ] Phase 6: Create on orq.ai
- [ ] Phase 7: Set up ongoing maintenance
create_llm_eval or create_python_evalCompanion skills:
orq-run-experiment — run experiments using the evaluators you buildorq-analyze-trace-failures — identify failure modes that evaluators should targetorq-generate-synthetic-dataset — generate test data for evaluator validationorq-optimize-prompt — iterate on prompts based on evaluator resultsorq-build-agent — create agents that evaluators assessorq-run-experimentorq-analyze-trace-failuresorq-optimize-promptorq-generate-synthetic-datasetOfficial documentation: Evaluators API — Programmatic Evaluation Setup
Evaluators · Creating Evaluators · Evaluator Library · Evaluators API · Human Review · Datasets · Traces
{{log.input}}, {{log.output}}, {{log.messages}}, {{log.retrievals}}, {{log.reference}}Use the orq MCP server (https://my.orq.ai/v2/mcp) as the primary interface. For operations not yet available via MCP, use the HTTP API as fallback.
Available MCP tools for this skill:
| Tool | Purpose |
|---|---|
create_llm_eval | Create an LLM evaluator with your judge prompt |
create_python_eval | Create a Python evaluator for code-based checks |
evaluator_get | Retrieve any evaluator by ID |
list_models | List available judge models |
HTTP API fallback (for operations not yet in MCP):
# List existing evaluators (paginated: returns {data: [...], has_more: bool})
# Use ?limit=N to control page size. If has_more is true, fetch the next page with ?after=<last_id>
curl -s https://api.orq.ai/v2/evaluators \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" | jq
# Get evaluator details
curl -s https://api.orq.ai/v2/evaluators/<ID> \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" | jq
# Test-invoke an evaluator against a sample output
curl -s https://api.orq.ai/v2/evaluators/<ID>/invoke \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"output": "The LLM output to evaluate", "query": "The original input", "reference": "Expected answer"}' | jq
Before building anything, internalize these non-negotiable best practices:
Cost hierarchy (cheapest to most expensive):
Follow these steps in order. Do NOT skip steps.
Ask the user what they want to evaluate. Clarify:
Determine if LLM-as-Judge is the right approach. Challenge the user:
If the user has NOT done error analysis, guide them through it:
For each failure mode that needs LLM-as-Judge, define:
You are an expert evaluator assessing outputs from [SYSTEM DESCRIPTION].
## Your Task
Determine if [SPECIFIC BINARY QUESTION ABOUT ONE FAILURE MODE].
## Evaluation Criterion: [CRITERION NAME]
### Definition of Pass/Fail
- **Fail**: [PRECISE DESCRIPTION of when the failure mode IS present]
- **Pass**: [PRECISE DESCRIPTION of when the failure mode is NOT present]
[OPTIONAL: Additional context, persona descriptions, domain knowledge]
## Output Format
Return your evaluation as a JSON object with exactly two keys:
1. "reasoning": A brief explanation (1-2 sentences) for your decision.
2. "answer": Either "Pass" or "Fail".
## Examples
### Example 1:
**Input**: [example input]
**Output**: [example LLM output]
**Evaluation**: {"reasoning": "[explanation]", "answer": "Fail"}
### Example 2:
**Input**: [example input]
**Output**: [example LLM output]
**Evaluation**: {"reasoning": "[explanation]", "answer": "Pass"}
[2-6 more examples, drawn from labeled training set]
## Now evaluate the following:
**Input**: {{input}}
**Output**: {{output}}
[OPTIONAL: **Reference**: {{reference}}]
Your JSON Evaluation:
Ensure you have labeled data for validation. You need:
If labels are insufficient, set up human labeling:
Using orq.ai Annotation Queues (recommended):
Using orq.ai Human Review:
Labeling guidelines for reviewers:
Split labeled data into three disjoint sets:
Refinement loop (repeat until TPR and TNR > 90% on dev set): a. Run the evaluator over all dev examples b. Compare each judgment to human ground truth c. Compute TPR = (true passes correctly identified) / (total actual passes) d. Compute TNR = (true fails correctly identified) / (total actual fails) e. Inspect disagreements (false passes and false fails) f. Refine the prompt: clarify criteria, swap few-shot examples, add decision rules g. Re-run and measure again
If alignment stalls:
After finalizing the prompt, run it ONCE on the held-out test set:
theta_hat = (p_observed + TNR - 1) / (TPR + TNR - 1)Choose the evaluator type based on the criterion:
| Check Type | When to Use | MCP Tool |
|---|---|---|
| Code-based (regex, assertions, schema) | Deterministic checks: format validation, length limits, required fields, exact matches | create_python_eval |
| LLM-as-Judge | Subjective/nuanced criteria that code can't capture: tone, faithfulness, persona consistency | create_llm_eval |
If code-based (create_python_eval):
def evaluate(log) -> bool (or -> float for numeric scores)log dict has keys: output, input, referencenumpy, nltk, re, jsonimport re, json
def evaluate(log):
output = log["output"]
# Check that output is valid JSON with required fields
try:
parsed = json.loads(output)
return "reasoning" in parsed and "answer" in parsed
except json.JSONDecodeError:
return False
create_python_eval MCP tool with the Python codeIf LLM-as-Judge (create_llm_eval):
create_llm_eval with the refined judge prompt from Phase 3-5{{log.input}}, {{log.output}}, {{log.reference}} as neededCreate the evaluator on orq.ai:
Document the evaluator:
When building evaluators, STOP the user if they attempt any of these:
| Anti-Pattern | What to Do Instead |
|---|---|
| Using 1-10 or 1-5 scales | Binary Pass/Fail per criterion — scales introduce subjectivity and require more data |
| Bundling multiple criteria in one judge | One evaluator per failure mode — bundled judges are ambiguous and hard to debug |
| Using generic metrics (helpfulness, coherence, BERTScore, ROUGE) | Build application-specific criteria from error analysis |
| Skipping judge validation | Measure TPR/TNR on held-out labeled test set (100+ examples) |
| Using off-the-shelf eval tools uncritically | Build custom evaluators from observed failure modes |
| Building evaluators before fixing prompts | Fix obvious prompt gaps first — many failures are specification failures |
| Using dev set accuracy as official metric | Report accuracy ONLY from held-out test set |
| Having judge see its own few-shot examples in eval | Strict train/dev/test separation — contamination inflates metrics |
Before finalizing any judge prompt, verify:
To estimate true success rate from an imperfect judge:
theta_hat = (p_observed + TNR - 1) / (TPR + TNR - 1) [clipped to 0-1]
Where:
p_observed = fraction judged as "Pass" on new unlabeled dataTPR = judge's true positive rate (from test set)TNR = judge's true negative rate (from test set)If TPR + TNR - 1 <= 0, the judge is no better than random.
When the user lacks real traces for error analysis:
This two-step process produces more diverse data than asking an LLM to "generate test cases" directly.
When you need to look up orq.ai platform details, check in this order:
create_llm_eval, create_python_eval); API responses are always authoritativesearch_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmaticallyWhen this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.
npx claudepluginhub orq-ai/assistant-pluginsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.