From orq
Run cross-framework agent comparisons using evaluatorq from orqkit — compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when comparing agents, benchmarking, or wanting side-by-side evaluation. Do NOT use when comparing only orq.ai configurations with no external agents (use orq-run-experiment instead).
How this skill is triggered — by the user, by Claude, or both
Slash command
/orq:orq-compare-agentsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an **orq.ai agent comparison specialist**. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using `evaluatorq` ([orqkit](https://github.com/orq-ai/orqkit)), then viewing results in the orq.ai Experiment UI.
You are an orq.ai agent comparison specialist. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using evaluatorq (orqkit), then viewing results in the orq.ai Experiment UI.
Supported comparison modes:
orq-generate-synthetic-dataset skill or use { dataset_id: "..." } (Python) / { datasetId: "..." } (TypeScript) to load from the platform.orq-build-evaluator skill.Why these constraints: Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors.
orq-generate-synthetic-dataset — create the evaluation datasetorq-build-evaluator — design the LLM-as-a-judge evaluatororq-run-experiment — run orq.ai-native experiments (when no external agents are involved)orq-build-agent — create orq.ai agents to include in comparisonsorq-analyze-trace-failures — diagnose agent failures from trace dataCopy this to track progress:
Agent Comparison Progress:
- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
- [ ] Phase 2: Create dataset (→ orq-generate-synthetic-dataset)
- [ ] Phase 3: Create evaluator (→ orq-build-evaluator)
- [ ] Phase 4: Generate comparison script
- [ ] Phase 5: Run and view results in orq.ai
orq-analyze-trace-failures)orq-generate-synthetic-datasetorq-build-evaluatororq-run-experimentorq-analyze-trace-failuresOfficial documentation: Evaluatorq Tutorial
Experiments · Evaluators · Agent Responses API · Datasets
evaluatorq (Python) and @orq-ai/evaluatorq (TypeScript)ORQ_API_KEY is set| Tool | Purpose |
|---|---|
search_entities | Find orq.ai agent keys (use type: "agent") |
create_dataset | Create a dataset |
create_datapoints | Populate dataset with test cases |
create_llm_eval | Create an LLM-as-a-judge evaluator |
ORQ_API_KEY environment variable is setpip install evaluatorq orq-ai-sdknpm install @orq-ai/evaluatorqAsk the user which agents to compare. For each agent, determine:
For orq.ai agents, get the agent key:
search_entities MCP tool with type: "agent" to find available agentsFor external agents, confirm they can be called from Python/TypeScript:
Ask the user's language preference: Python or TypeScript. Default to Python if no preference.
Delegate to orq-generate-synthetic-dataset to create a dataset with 5-10 datapoints.
Critical reminders for cross-framework comparison datasets:
Delegate to orq-build-evaluator to create an LLM-as-a-judge evaluator. Save the returned evaluator ID.
For quick experiments, use the create_llm_eval MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see gotchas).
Select job patterns from resources/job-patterns.md for each agent's framework.
Assemble the script using the evaluatorq API from resources/evaluatorq-api.md:
evaluatorq() callCommon configurations:
| Experiment Type | Jobs to Include |
|---|---|
| External vs orq.ai | One external job + one orq.ai job |
| orq.ai vs orq.ai | Two orq.ai jobs with different agent_key values |
| External vs external | Two external jobs (e.g., LangGraph + CrewAI) |
| Multi-agent | Three or more jobs of any type |
Replace all placeholders in the generated script:
<EVALUATOR_ID> — evaluator ID from Phase 3<AGENT_KEY> — orq.ai agent key(s) from Phase 1<experiment-name> — descriptive experiment nameRun the script:
# Python
export ORQ_API_KEY="your-key"
python evaluate.py
# TypeScript
export ORQ_API_KEY="your-key"
npx tsx evaluate.ts
View results in orq.ai:
If issues arise, check resources/gotchas.md for common pitfalls.
Iterate: If one agent consistently underperforms, investigate with orq-analyze-trace-failures, improve with orq-optimize-prompt, then re-run the comparison.
After running the comparison:
When this skill conflicts with live API responses or docs.orq.ai, trust the API.
npx claudepluginhub orq-ai/assistant-pluginsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.