From oxy-skills
Use when an Oxy agent is giving wrong, incomplete, or inconsistent answers — whether the user reports failing/flaky tests, shares a specific prompt with a bad response, says 'the agent isn't answering this correctly', 'this response is wrong', 'investigate why this doesn't work', 'tests are failing', 'fix this flaky test', 'the answer should be X but the agent says Y', 'debug this eval', 'make this test pass', or generally complains that their agent's output is unreliable. Also use when the user pastes test output JSON, trace data, or a prompt+response pair and wants it diagnosed and fixed. Diagnoses failures from `oxy test --output-json` results, observability traces, or user-reported prompt/response pairs, then makes targeted repairs to semantic layer files (views/topics) and agent system instructions — never weakens the tests.
How this skill is triggered — by the user, by Claude, or both
Slash command
/oxy-skills:oxy-repairThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an expert at diagnosing and fixing failing or flaky Oxy test cases. Your role is to find the root cause of incorrect or inconsistent agent answers and make targeted repairs to the semantic layer (views/topics) and/or agent system instructions so the agent produces the correct answer reliably.
You are an expert at diagnosing and fixing failing or flaky Oxy test cases. Your role is to find the root cause of incorrect or inconsistent agent answers and make targeted repairs to the semantic layer (views/topics) and/or agent system instructions so the agent produces the correct answer reliably.
Key principle: The expected answer is the source of truth. Repair the system to match the expected answer — never rewrite the expected answer to match current behavior.
Scope boundary: This skill repairs semantic layer files (*.view.yml, *.topic.yml), agent YAML (*.agent.yml), and closely related configuration. It does not modify .test.yml files except in narrow mechanical ways (e.g. adding a missing name: field to a case). It never rewrites expected strings.
Relationship to oxy-test-drafter: oxy-test-drafter creates and fills .test.yml files. oxy-repair takes an existing failing test and fixes the system so the test passes. They are complementary — drafter writes the spec, repair fixes the implementation.
Activate when the user:
In short: if the user is complaining about the quality or correctness of agent responses — whether framed as a test failure, a bad answer to a prompt, or a general reliability issue — this skill applies.
Do not activate when:
oxy-test-drafter)expected strings (use oxy-test-drafter)Typical files the skill needs to navigate:
semantics/views/*.view.yml # dimensions, measures, joins, filters
semantics/topics/*.topic.yml # groups views into queryable semantic topics
tests/*.test.yml # prompts, expected answers, judge settings
tests/*.results.json # output from oxy test --output-json
*.agent.yml # system instructions and tool configuration
config.yml # project configuration
Run the failing case(s) and collect evidence:
cd <repo-root>
# Single case by name (preferred)
oxy test <test-file> --case <name> --output-json
# Single case by prompt string
oxy test <test-file> --case "What is the total revenue?" --output-json
# Single case by 0-based index
oxy test <test-file> --case 0 --output-json
# Full suite (when multiple cases fail)
oxy test <test-file> --output-json
Note: Some projects use
oxy-debug testinstead ofoxy test. Check the project's conventions and use whichever command the repo uses. The flags (--case,--output-json,--tag) are the same either way.
This produces:
.results.json file in the same directory as the test fileReading the results file — use Read and Grep, never Python or Bash:
--case run (single case): the file is small. Read it directly in full.Do not use Python or Bash to parse the JSON. The Read and Grep tools are sufficient and require no shell approval.
For each run attempt, inspect these fields:
| Field | What it tells you |
|---|---|
expected | The source of truth — what the answer should be |
actual_output | What the agent actually said |
score | 0.0–1.0 correctness rating from the judge |
cot | Judge's chain-of-thought reasoning |
choice | Judge's PASS/FAIL verdict |
references | Array of tool calls the agent made (see below) |
The references array is critical for diagnosis. Each entry contains:
| Field | Meaning |
|---|---|
type | Tool type (e.g. semanticQuery, execQL) |
topic | Which semantic topic was queried |
sql_query | The actual SQL generated |
result | The data returned |
Compare across all run attempts:
Look for these patterns:
Diagnose before fixing. Explicitly classify the failure into one of these categories:
Symptoms:
Fix: Add a dimension that directly surfaces the grouping or categorization the question needs. If a question requires grouping data into categories, the semantic layer should expose those categories directly so the semantic query returns a small, clean result instead of forcing the agent to post-process many rows.
Example: Question asks for above-average vs below-average comparison. Existing view only exposes the raw numeric field. Fix is to add a computed dimension:
- name: unemployment_vs_avg
type: string
description: "Whether the regional unemployment rate is above or below the dataset average (~8.0%). Use this to compare sales performance between high and low unemployment regions."
expr: |
CASE
WHEN Unemployment >= 8.0 THEN 'Above Average (>=8%)'
ELSE 'Below Average (<8%)'
END
synonyms: ["unemployment above below average", "above average unemployment", "below average unemployment", "unemployment comparison"]
samples: ["Above Average (>=8%)", "Below Average (<8%)"]
If the same logical fix belongs in multiple relevant views, update all appropriate views rather than patching only one narrow path.
Symptoms:
Fix: Add a measure that computes the metric directly when that is semantically appropriate. Example: if the agent keeps manually dividing revenue by order count, add an avg_order_value measure.
Symptoms:
execute_sql when semantic_query would be more robust)Fix: Tighten general guidance in system instructions. Keep guidance general and reusable — do not hard-code specific answers or one-off thresholds into instructions.
Example: Add "Always prefer semantic_query over execute_sql for data questions. Only use execute_sql when the semantic layer does not cover the needed data."
Symptoms:
cot indicates PASS or says there are no contradictionschoice is FAILFix: Do not chase this by mutating semantics or agent instructions. Surface it explicitly as a judge issue. Recommend a better judge model when appropriate (e.g. openai-5-mini → a stronger model). Do not auto-edit tests just to address judge weirdness.
Symptoms:
Fix: Diagnose this explicitly. Do not silently weaken the expected answer. Only propose test prompt clarification if that is clearly the real issue — and surface the recommendation to the user rather than editing the test file yourself.
Symptoms:
Fix: State this clearly. Do not invent semantic hacks or instruction hacks to fake support for data that doesn't exist.
Repair priority order (try earlier options first):
Semantic layer fixes — most failures stem from here
Agent system instruction fixes — when the semantic layer is correct but the agent misuses it
Agent configuration fixes — rarely needed
Make the targeted edits using the Edit tool. For each change:
After making changes:
Step 1: Rebuild the semantic layer:
cd <repo-root>
oxy build
Step 2: Rerun the failing case:
oxy test <test-file> --case <name> --output-json
Step 3: Read the new results and evaluate.
One passing run is suggestive, not conclusive. Prefer at least 2–3 rounds of validation when practical. Distinguish true behavior improvements from judge noise.
oxy test <test-file> --output-json) and adjust the repair to fix the original case without breaking others.Print a diagnostic summary to the conversation:
## Repair Summary
Target test: tests/<file>.test.yml
Target case: <name or prompt>
Root cause: <category + one-line explanation>
Files changed:
- semantics/views/<view>.view.yml
- <agent>.agent.yml
What changed:
- added dimension <name> to <view> — <why>
- tightened system instruction to prefer semantic_query — <why>
Validation:
- round 1: 2/3 pass
- round 2: 3/3 pass
- round 3: 3/3 pass
Notes:
- <any remaining issues, judge-model observations, or follow-up recommendations>
When the user provides observability or trace information from the Oxy platform (rather than just test output), use it as additional diagnostic evidence:
Traces are supplementary evidence. The repair workflow remains the same: diagnose, plan, apply, validate. The skill works well even when only the test JSON exists.
Use the DeepWiki MCP (ask_question on oxy-hq/oxy) when you need context on:
Note: the newer test framework features may not yet be fully covered in DeepWiki. The local .results.json output and repo inspection are the primary source of truth for repair work.
Symptom: Question asks for a grouped comparison (above/below average, by category, by tier). Agent fetches raw data and tries to compute the grouping itself. Results vary across runs.
Fix: Add a computed dimension to the view that directly classifies the rows:
dimensions:
- name: performance_tier
type: string
description: "Whether the store's monthly revenue is above or below the chain average (~$150K). Use for performance tier comparisons."
expr: |
CASE
WHEN monthly_revenue >= 150000 THEN 'Above Average'
ELSE 'Below Average'
END
synonyms: ["performance tier", "above below average", "store performance comparison"]
samples: ["Above Average", "Below Average"]
Symptom: Agent keeps manually computing a ratio or percentage from raw measures. The arithmetic varies across runs.
Fix: Add a measure that computes it directly:
measures:
- name: avg_order_value
type: number
description: "Average revenue per order"
expr: "SUM(revenue) / NULLIF(COUNT(order_id), 0)"
Symptom: Agent uses execute_sql / execQL when semantic_query would produce more reliable results. The references array shows type: execQL instead of type: semanticQuery.
Fix: Tighten system instructions to prefer semantic queries:
system_instructions: |
...existing instructions...
Always prefer semantic_query for data questions. Only use execute_sql
when the semantic layer does not cover the needed data. If a semantic
query fails, investigate why before falling back to raw SQL.
Symptom: Agent answer looks correct. Judge cot says the answer is reasonable or mentions no contradictions. But choice is FAIL and score is low.
Fix: Do not modify the semantic layer or agent instructions. Report this as a judge-model issue and recommend upgrading the judge model (e.g. from openai-5-mini to a stronger model). Do not auto-edit the test file.
Symptom: Agent says it doesn't have access to certain data. The view exists but isn't included in any topic the agent can query.
Fix: Add the view to the relevant topic file:
# analytics.topic.yml
views:
- sales
- inventory
- customers # was missing — agent couldn't discover this view
Symptom: Agent picks the wrong dimension across runs (e.g. created_date vs order_date). The references show different sql_query values across attempts.
Fix: Improve descriptions and add synonyms to disambiguate:
dimensions:
- name: order_date
type: date
description: "Date the order was placed. Use this for revenue-by-time and sales trend queries."
synonyms: ["sale date", "transaction date", "when ordered"]
- name: created_date
type: date
description: "Date the database record was created. Internal metadata — not for business queries."
Never do these:
expected strings in .test.yml to make tests passStrong preferences:
# Run a single case by name (preferred for targeted repair)
oxy test tests/analyst.sales_performance.test.yml --case total-revenue-all-stores --output-json
# Run a single case by prompt string
oxy test tests/analyst.sales_performance.test.yml --case "What is the total revenue?" --output-json
# Run a single case by 0-based index
oxy test tests/analyst.sales_performance.test.yml --case 0 --output-json
# Run full suite (to check for regressions)
oxy test tests/analyst.sales_performance.test.yml --output-json
# Filter by tag
oxy test tests/analyst.sales_performance.test.yml --output-json --tag revenue
# Run all test files
oxy test --output-json
# Rebuild semantic layer after making changes
oxy build
Always run from the repo root of the target project.
--output-json produces an array of EvalResult objects:
[
{
"test_name": "analyst.sales_performance",
"errors": [],
"stats": {
"total_attempted": 9,
"answered": 9
},
"metrics": [
{
"type": "Correctness",
"score": 0.85,
"records": [
{
"prompt": "What is the total revenue for all stores?",
"expected": "Total revenue is approximately $6.7 billion.",
"actual_output": "Total revenue across all 45 stores is $6.74B...",
"cot": "...", // judge's chain-of-thought
"choice": "PASS", // judge's verdict
"score": 1.0,
"duration_ms": 4200.0,
"references": [ // tool calls the agent made
{
"type": "semanticQuery",
"topic": "sales",
"sql_query": "SELECT SUM(revenue) FROM ...",
"result": "..."
}
]
}
// one record per run attempt (runs: 3 → 3 records per case)
]
}
]
}
]
Key fields for diagnosis:
actual_output — what the agent actually said (compare against expected)expected — the source of truth (do not modify)score — 0.0 to 1.0 correctness rating from the judgecot — judge's reasoning about why the score was givenchoice — judge's PASS/FAIL verdict (compare against cot to detect false negatives)references — tool calls made by the agent: check type, topic, sql_query, and result to understand what data the agent actually retrievednpx claudepluginhub oxy-hq/skills --plugin oxy-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.