From qa-search-relevance
Bootstraps human-relevance judgment lists (query sets, grading scales, rater guidelines, inter-rater agreement, Quepid tooling, TREC-style pooling, and refresh cadence) that serve as ground truth for all three search-relevance skills and the relevance-regression-reviewer agent. Use when a team needs to create or refresh the judgment corpus before running NDCG / MRR / Recall@k evaluations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-search-relevance:judgment-list-authorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The other skills in this plugin (`elasticsearch-relevance-tests`,
The other skills in this plugin (elasticsearch-relevance-tests,
opensearch-relevance-tests, vector-search-precision-tests) and the
relevance-regression-reviewer agent all require a judgment list - a set of
(query, document_id, grade) triples that define what "relevant" means for
your product. Nothing else in this plugin creates that corpus. This skill does.
Per [TREC's pooling methodology], "NIST pools the individual results, judges the retrieved documents for correctness, and evaluates the results" - the judgment list is the non-automated step that all automated metrics depend on.
_rank_eval or recall@k for the first time: no judgment
list means no metrics.relevance-regression-reviewer reports > 30% unrated docs across
queries: the judgment pool is stale.Not all queries deserve equal judgment effort. Sample across three tiers:
| Tier | Definition | Suggested count | Priority |
|---|---|---|---|
| Head | Top ~100-500 by traffic volume (covers ~80% of impressions) | 50-100 queries | Highest |
| Torso | Queries ranked ~500-5000 (navigational, faceted) | 100-200 queries | Medium |
| Tail | Low-frequency, long-tail queries | 50-100 queries | Lower |
Pull the sample from query logs, not from intuition. Use a 90-day window to avoid seasonal skew. Keep the raw log row for each sampled query - you will need it to assign traffic weight when computing weighted NDCG.
A starting corpus of roughly 200-400 queries across tiers gives sufficient coverage for NDCG-based gates. Per [Quepid docs], "100 judgments (10 pages of 10 search results) serves as a solid foundation" for initial evaluation projects.
Two options are widely used:
Per [TREC qrels format], the classic qrels file uses two values:
0 (not relevant) and 1 (relevant). Simple to collect; sufficient for
Precision@k and Recall@k. Use binary when raters have low domain expertise
or when the query intent is unambiguous (navigational queries).
Qrels file format (4 columns, per [TREC qrels format]):
<topic_id> <iteration> <doc_id> <relevance>
1 0 doc-abc-123 0
1 0 doc-xyz-456 1
The iteration column is "almost always zero and not used" per TREC.
Unjudged documents are assumed irrelevant in evaluation.
Per [Quepid judgment rating best practices], the 0-3 scale maps to:
| Grade | Label | Rater cue |
|---|---|---|
| 3 | Perfect | "This is exactly what I am looking for." |
| 2 | Good | "Relevant - I want these results, but haven't found the exact one yet." |
| 1 | Fair | "I see the connection, but these are not what I am looking for." |
| 0 | Poor | "These are terrible - I would search elsewhere." |
Use graded when you need NDCG or DCG (metrics that reward highly relevant
results at higher ranks). Required by elasticsearch-relevance-tests Step 1
which uses a 4-point (0-3) scale.
Grading guidelines prevent scale drift between raters and across time. A minimal guidelines document covers:
unrateable
flag per [Quepid API] (unrateable: boolean). Use it for documents where
the rater cannot determine relevance (e.g. page behind a login wall).Store the guidelines in version control alongside the judgment file. When guidelines change, treat it as a new judgment round - old and new grades are not comparable.
[Quepid] (github.com/o19s/quepid) is the standard open-source tool for collaborative judgment authoring. It runs as a self-hosted Rails app and connects to Elasticsearch, OpenSearch, Solr, Algolia, and other backends.
Setup flow:
docker-compose up per the Quepid README.user_id per judgment per [Quepid API],
enabling per-rater analysis.query_text, doc_id, <judge_1_name>, <judge_2_name>, ... and filename
book_{id}_judgements.csv. JSON and Learning-to-Rank formats are also
supported per [Quepid docs].For teams without Quepid, a spreadsheet works for small sets (under 500
judgments): columns query_id, query_text, doc_id, grade, rater_id, notes.
Convert to qrels format before feeding _rank_eval.
Have at least two independent raters judge the same 10-20% overlap set. Compute Cohen's kappa to verify the scale is being applied consistently.
Formula per [Cohen's kappa, Wikipedia]:
kappa = (p_o - p_e) / (1 - p_e)
where p_o is observed agreement and p_e is expected chance agreement.
Interpretation thresholds (Landis and Koch, 1977, as cited in [Cohen's kappa, Wikipedia]):
| Kappa | Agreement |
|---|---|
| < 0.20 | Slight - raters are guessing; revise guidelines |
| 0.21-0.40 | Fair - guidelines unclear; calibrate with examples |
| 0.41-0.60 | Moderate - acceptable for exploratory work |
| 0.61-0.80 | Substantial - good for production gates |
| 0.81-1.00 | Almost perfect - target for high-stakes domains |
For binary judgments, kappa < 0.60 means your guidelines are ambiguous. Rewrite the edge-case section, run a calibration session with raters, and re-judge the overlap set before proceeding.
from sklearn.metrics import cohen_kappa_score
rater_a = [3, 2, 1, 0, 3, 2, 2, 1, 0, 3]
rater_b = [3, 2, 0, 0, 3, 1, 2, 1, 0, 2]
kappa = cohen_kappa_score(rater_a, rater_b)
print(f"Cohen's kappa: {kappa:.3f}")
# 0.61-0.80 = substantial agreement; proceed to full rating round
When multiple raters cover non-overlapping document sets (common for large judgment pools), use Krippendorff's alpha instead - it handles missing data across raters.
When you have results from more than one system (e.g. current BM25 + candidate neural re-ranker), use TREC-style depth-k pooling to maximize judgment coverage.
Per [TREC pooling, Wikipedia], the method "aggregates the top-ranked n documents from each participating system's results, creating a manageable subset for comprehensive judgment."
Practical pooling:
def pool_results(system_results: dict[str, list[str]], depth: int = 10) -> set[str]:
"""
system_results: { system_name: [doc_id, ...] } top-depth per query
Returns the union of all doc IDs in the pool.
"""
pool = set()
for docs in system_results.values():
pool.update(docs[:depth])
return pool
Pool depth = 10 is standard for small collections (< 100k docs). Increase to 20-50 when you have > 3 candidate systems to avoid missing relevant documents that only appear in one system's lower ranks.
Documents outside the pool are unjudged. Per [TREC qrels format], unjudged documents are treated as irrelevant in metric computation. This is conservative but consistent across systems.
Judgment lists go stale when the document corpus changes substantially. Define explicit refresh triggers:
| Trigger | Action |
|---|---|
| Index schema change (new field, new analyzer) | Full re-pool + partial re-judge (re-rate 20% overlap) |
| Embedding model upgrade | Full re-pool for all affected query tiers |
| Corpus grows > 20% | Re-pool; re-judge new documents only |
relevance-regression-reviewer flags > 30% unrated | Partial re-judge: new docs in the unrated set |
| New product category / language added | Add new query stratum; judge from scratch for that stratum |
As a minimum, run a lightweight staleness check monthly: for each query,
count the unrated_docs fraction in _rank_eval results. If the average
exceeds 20%, schedule a re-judging session.
The judgment list consumed by elasticsearch-relevance-tests,
opensearch-relevance-tests, and relevance-regression-reviewer is a
JSON array:
[
{
"query_id": "q001",
"query": "running shoes",
"ratings": [
{ "doc_id": "doc-abc", "rating": 3 },
{ "doc_id": "doc-xyz", "rating": 1 },
{ "doc_id": "doc-mno", "rating": 0 }
]
}
]
Or equivalently as a TREC qrels file (4 columns) for binary cases, which
tools like trec_eval and the Elasticsearch _rank_eval API both accept
after mapping doc_id to the index's _id field.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Judge only the current system's top-10 | Biases the corpus toward the incumbent; new systems retrieve different docs | Pool from all candidate systems (Step 6) |
| One rater, no kappa check | Silent scale drift; metrics become meaningless over time | Require 10-20% overlap + kappa >= 0.60 (Step 5) |
| Reuse judgments after embedding model upgrade | Vector space changed; doc rankings shift entirely | Re-pool and re-judge after any embedding change |
| Judge head queries only | Tail queries drive long-tail revenue; regressions go undetected | Sample across head/torso/tail (Step 1) |
| Treat unjudged docs as relevant | Inflates recall metrics artificially | Default unjudged to irrelevant per TREC convention |
| No version control on guidelines | Raters from two periods use different scales; grades are incompatible | Store guidelines in git; treat guideline changes as a new round |
rating, unrateable, judge_later,
explanation), CSV export structure book_{id}_judgements.csv:
https://github.com/o19s/quepid/blob/main/app/controllers/api/v1/judgements_controller.rbelasticsearch-relevance-tests -
consumes judgment lists for _rank_evalopensearch-relevance-tests -
consumes judgment lists for OpenSearch rank evalvector-search-precision-tests -
consumes judgment lists for recall@k evaluationrelevance-regression-reviewer -
reviewer that requires a judgment list as inputProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub testland/qa --plugin qa-search-relevance