From langfuse
Use when the user wants to compare two or more experiment runs, detect regressions, see score deltas between runs, or evaluate model performance differences. Trigger phrases include "compare runs", "compare experiments", "diff runs", "regression check", "which run is better", "model comparison", "A/B comparison".
How this skill is triggered — by the user, by Claude, or both
Slash command
/langfuse:compare-experimentsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Compare two or more dataset runs side by side: score deltas, per-item regressions, and aggregate performance differences.
Compare two or more dataset runs side by side: score deltas, per-item regressions, and aggregate performance differences.
Storage note: Run items, traces, and scores live in ClickHouse, not Postgres. Use the REST API or ClickHouse direct queries for analysis. Run metadata lives in Postgres.
If the user specifies run names, use them. Otherwise, list recent runs (via list-dataset-runs skill) and let the user select two or more.
Collect run IDs from Postgres:
SELECT id, name, metadata, created_at
FROM dataset_runs
WHERE project_id = '{PROJECT_ID}'
AND dataset_id = '{DATASET_ID}'
AND name IN ('{RUN_A_NAME}', '{RUN_B_NAME}')
ORDER BY created_at;
Compare aggregate score statistics across runs via ClickHouse:
SELECT
dri.dataset_run_name AS run_name,
s.name AS score_name,
COUNT(*) AS count,
round(AVG(s.value), 4) AS mean,
round(quantile(0.5)(s.value), 4) AS median,
MIN(s.value) AS min,
MAX(s.value) AS max
FROM dataset_run_items_rmt dri
JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_id = '{DATASET_ID}'
AND dri.dataset_run_name IN ('{RUN_A_NAME}', '{RUN_B_NAME}')
AND s.data_type = 'NUMERIC'
GROUP BY dri.dataset_run_name, s.name
ORDER BY s.name, dri.dataset_run_name;
Run via: docker exec langfuse-clickhouse clickhouse-client --query "...".
Present as a comparison table:
| Score | Run A Mean | Run B Mean | Delta | Better |
|---|
Delta = Run B - Run A. Better: ↑ improved, ↓ regressed, = unchanged.
For the same dataset items, compare scores across runs via ClickHouse:
SELECT
dri_a.dataset_item_id AS item_id,
s_a.name AS score_name,
s_a.value AS run_a_score,
s_b.value AS run_b_score,
round(s_b.value - s_a.value, 4) AS delta
FROM dataset_run_items_rmt dri_a
JOIN scores s_a ON dri_a.trace_id = s_a.trace_id AND dri_a.project_id = s_a.project_id
JOIN dataset_run_items_rmt dri_b ON dri_a.dataset_item_id = dri_b.dataset_item_id
AND dri_b.project_id = dri_a.project_id
AND dri_b.dataset_run_id = '{RUN_B_ID}'
JOIN scores s_b ON dri_b.trace_id = s_b.trace_id AND dri_b.project_id = s_b.project_id
AND s_b.name = s_a.name
WHERE dri_a.project_id = '{PROJECT_ID}'
AND dri_a.dataset_run_id = '{RUN_A_ID}'
AND s_a.data_type = 'NUMERIC'
ORDER BY delta ASC;
Present items with largest regressions first:
| Item ID | Score | Run A | Run B | Delta |
|---|
Identify regressions:
delta < 0 (or below a threshold).SELECT dri_a.dataset_item_id,
countIf(s_b.value < s_a.value) AS regressions,
countIf(s_b.value > s_a.value) AS improvements,
countIf(s_b.value = s_a.value) AS unchanged
FROM dataset_run_items_rmt dri_a
JOIN scores s_a ON dri_a.trace_id = s_a.trace_id AND dri_a.project_id = s_a.project_id
JOIN dataset_run_items_rmt dri_b ON dri_a.dataset_item_id = dri_b.dataset_item_id
AND dri_b.project_id = dri_a.project_id
AND dri_b.dataset_run_id = '{RUN_B_ID}'
JOIN scores s_b ON dri_b.trace_id = s_b.trace_id AND dri_b.project_id = s_b.project_id
AND s_b.name = s_a.name
WHERE dri_a.project_id = '{PROJECT_ID}'
AND dri_a.dataset_run_id = '{RUN_A_ID}'
AND s_a.data_type = 'NUMERIC'
GROUP BY dri_a.dataset_item_id
HAVING countIf(s_b.value < s_a.value) > 0
ORDER BY regressions DESC;
Note: ClickHouse uses countIf() instead of Postgres COUNT(*) FILTER (WHERE ...).
Compare run metadata from Postgres to understand what changed:
SELECT name, metadata, created_at
FROM dataset_runs
WHERE project_id = '{PROJECT_ID}'
AND id IN ('{RUN_A_ID}', '{RUN_B_ID}')
ORDER BY created_at;
Common differences: model change, prompt version, temperature, provider.
Present:
Links:
{HOST}/project/{PROJECT_ID}/datasets/{DATASET_ID}/runs/{RUN_A_ID}{HOST}/project/{PROJECT_ID}/datasets/{DATASET_ID}/runs/{RUN_B_ID}Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub alex-kopylov/zweihander --plugin langfuse