From qa-search-relevance
Author Elasticsearch relevance regression tests using the Ranking Evaluation API (`POST <index>/_rank_eval`) - judgment lists (query + expected docs at ranks), per-query metrics (Precision@K, Recall@K, MRR, DCG, ERR), reproducible test corpora; pair with Quepid + Splainer for interactive judgment authoring.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-search-relevance:elasticsearch-relevance-testsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per the [Elasticsearch Rank Eval API], the `_rank_eval` endpoint
Per the Elasticsearch Rank Eval API, the _rank_eval endpoint
"evaluates search result quality across typical queries using
relevance metrics." This is the canonical IR-metrics-driven approach
to search QA - far better than spot-checking results.
A judgment is (query, doc_id, rating). Ratings: 0 = irrelevant,
1 = somewhat, 2 = relevant, 3 = highly relevant (4-point scale).
Build judgments via:
| Source | Method |
|---|---|
| Query logs + click data | Click model (clicked = ≥1, multi-click = ≥2) |
| Quepid (open source) | Interactive UI for judges to rate per-query results |
| Splainer | Diagnose why a doc ranked where it did |
| Domain SMEs | High-stakes queries; manual rating |
Judgment list format (CSV is common):
query,doc_id,rating
"running shoes",sku-1234,3
"running shoes",sku-5678,2
"running shoes",sku-9999,0
"red dress",sku-2222,3
Per the Elasticsearch Rank Eval API:
| Metric | When to use |
|---|---|
| Precision@K | "Of the top K, how many relevant?" - flat scoring |
| Recall@K | "Of all relevant, how many in top K?" - completeness |
| MRR | "Where's the first relevant?" - search where one good answer suffices |
| DCG / NDCG | Graded relevance; rank-discounted; the default for graded judgments |
| ERR (Expected Reciprocal Rank) | User-stops-at-first-relevant model; rank-decay sensitive |
For e-commerce with graded judgments → NDCG@10 + MRR. For Q&A → MRR
Per the Elasticsearch Rank Eval API:
POST products/_rank_eval
{
"requests": [
{
"id": "running_shoes_query",
"request": {
"query": { "match": { "name": "running shoes" } }
},
"ratings": [
{ "_index": "products", "_id": "sku-1234", "rating": 3 },
{ "_index": "products", "_id": "sku-5678", "rating": 2 },
{ "_index": "products", "_id": "sku-9999", "rating": 0 }
]
},
{
"id": "red_dress_query",
"request": { "query": { "match": { "name": "red dress" } } },
"ratings": [
{ "_index": "products", "_id": "sku-2222", "rating": 3 }
]
}
],
"metric": {
"dcg": { "k": 10, "normalize": true }
}
}
Response shape:
{
"metric_score": 0.84,
"details": {
"running_shoes_query": { "metric_score": 0.91, "unrated_docs": [...] },
"red_dress_query": { "metric_score": 0.77, "unrated_docs": [...] }
}
}
import requests, csv
def load_judgments(path):
by_query = {}
with open(path) as f:
for row in csv.DictReader(f):
by_query.setdefault(row["query"], []).append({
"_index": "products",
"_id": row["doc_id"],
"rating": int(row["rating"]),
})
return by_query
def test_search_relevance_baseline():
judgments = load_judgments("tests/judgments.csv")
requests_payload = [
{
"id": q.replace(" ", "_"),
"request": { "query": { "match": { "name": q } } },
"ratings": ratings,
}
for q, ratings in judgments.items()
]
body = {
"requests": requests_payload,
"metric": { "dcg": { "k": 10, "normalize": true } },
}
r = requests.post("http://localhost:9200/products/_rank_eval", json=body)
result = r.json()
# Baseline NDCG must not regress vs known-good
assert result["metric_score"] >= 0.80, f"NDCG@10 regressed: {result['metric_score']}"
Aggregate metric only catches large shifts. Track per-query:
def test_no_query_drops_more_than_10_percent():
current = run_rank_eval()
baseline = json.loads(Path("tests/baseline.json").read_text())
for query_id, baseline_score in baseline["details"].items():
current_score = current["details"][query_id]["metric_score"]
delta = current_score - baseline_score["metric_score"]
assert delta >= -0.10, \
f"Query {query_id} dropped {delta:.2f} (was {baseline_score['metric_score']:.2f}, now {current_score:.2f})"
relevant_rating_threshold for binary metricsPer the Elasticsearch Rank Eval API: Precision/Recall/MRR accept
relevant_rating_threshold (default 1). For graded judgments:
"metric": {
"precision": {
"k": 10,
"relevant_rating_threshold": 2,
"ignore_unlabeled": false
}
}
Rating ≥ 2 counted as "relevant"; below counted as "not relevant".
The ignore_unlabeled flag controls whether unrated docs in
results count against precision.
Snapshot the index state used for tests:
PUT _snapshot/test_repo/baseline_2026_05_06
{
"indices": "products",
"include_global_state": false
}
Restore for each CI run:
- name: Restore index snapshot
run: |
curl -X POST localhost:9200/_snapshot/test_repo/baseline_2026_05_06/_restore
Otherwise document changes (new docs, re-indexes) silently shift relevance baselines.
Quepid (open source from OpenSource Connections) provides:
Splainer explains why a doc ranked where it did - invaluable for debugging unexpected results.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Use binary judgments only | Loses graded info; NDCG degrades to Precision | 4-point scale (Step 1) |
| Rebuild judgments per test run | Bias from current ranking | Pinned judgment list (Step 1) |
| Track only aggregate NDCG | Hides per-query regressions | Per-query tracking (Step 5) |
| Test against changing index | Baselines move under your feet | Snapshot index (Step 7) |
| 100% click-derived judgments | Click bias to top results, position bias | Mix click + SME judgments |
opensearch-relevance-tests -
sister skill (compatible API)vector-search-precision-tests -
vector search analoguerelevance-regression-reviewernpx claudepluginhub testland/qa --plugin qa-search-relevanceProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.