From langfuse
Use when the user wants to analyze experiment results, inspect scores from a dataset run, check pass/fail rates, review per-item outputs, or deep-dive into experiment performance. Trigger phrases: "analyze results", "experiment scores", "how did the experiment perform", "show results", "inspect run", "experiment analysis".
How this skill is triggered — by the user, by Claude, or both
Slash command
/langfuse:analyze-experiment-resultsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Analyze the results of a Langfuse experiment run: aggregate scores, per-item details, pass/fail rates, and output inspection.
Analyze the results of a Langfuse experiment run: aggregate scores, per-item details, pass/fail rates, and output inspection.
If the user specifies a run name, use it. Otherwise, list recent runs (via list-dataset-runs skill) and let the user choose.
Get the run ID and dataset ID:
curl -s -u "$PUBLIC_KEY:$SECRET_KEY" \
"$HOST/api/public/datasets/{DATASET_NAME}/runs"
curl -s -u "$PUBLIC_KEY:$SECRET_KEY" \
"$HOST/api/public/dataset-run-items?runName={RUN_NAME}&datasetId={DATASET_ID}&limit=100"
Collect all traceId values — these link to the actual experiment traces.
Important: Traces, scores, and run items live in ClickHouse, not Postgres; use the REST API or direct ClickHouse queries.
Fetch scores per trace:
curl -s -u "$PUBLIC_KEY:$SECRET_KEY" \
"$HOST/api/public/scores?traceId={TRACE_ID}"
Loop over all trace IDs from Step 2 and collect scores.
SELECT dri.dataset_item_id, dri.trace_id,
s.name AS score_name, s.value, s.data_type, s.source, s.comment
FROM dataset_run_items_rmt dri
JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
ORDER BY dri.dataset_item_id, s.name;
Run via: docker exec langfuse-clickhouse clickhouse-client --query "...".
For each numeric score name, compute via ClickHouse:
SELECT s.name,
COUNT(*) AS count,
round(AVG(s.value), 4) AS mean,
round(quantile(0.5)(s.value), 4) AS median,
MIN(s.value) AS min,
MAX(s.value) AS max,
round(stddevPop(s.value), 4) AS stddev
FROM dataset_run_items_rmt dri
JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
AND s.data_type = 'NUMERIC'
GROUP BY s.name
ORDER BY s.name;
Note: Use ClickHouse quantile(0.5)() and stddevPop() instead of Postgres PERCENTILE_CONT and STDDEV.
Present as:
| Score Name | Count | Mean | Median | Min | Max | StdDev |
|---|
For categorical scores:
SELECT s.name, s.string_value, COUNT(*) AS count
FROM dataset_run_items_rmt dri
JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
AND s.data_type = 'CATEGORICAL'
GROUP BY s.name, s.string_value
ORDER BY s.name, count DESC;
For boolean scores, show pass/fail rates:
SELECT s.name,
countIf(s.value = 1) AS passed,
countIf(s.value = 0) AS failed,
COUNT(*) AS total,
round(100.0 * countIf(s.value = 1) / COUNT(*), 1) AS pass_rate
FROM dataset_run_items_rmt dri
JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
AND s.data_type = 'BOOLEAN'
GROUP BY s.name
ORDER BY s.name;
Use the denormalized ClickHouse table (which has inline item data):
SELECT dri.dataset_item_id AS item_id,
substring(dri.dataset_item_input, 1, 100) AS input_preview,
dri.trace_id,
substring(t.output, 1, 100) AS output_preview,
groupArray(concat(s.name, '=', toString(round(s.value, 2)))) AS scores
FROM dataset_run_items_rmt dri
LEFT JOIN traces t ON dri.trace_id = t.id AND dri.project_id = t.project_id
LEFT JOIN scores s ON dri.trace_id = s.trace_id
AND dri.project_id = s.project_id
AND s.data_type = 'NUMERIC'
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
GROUP BY dri.dataset_item_id, dri.dataset_item_input, dri.trace_id, t.output
ORDER BY dri.dataset_item_id;
Note: Use ClickHouse groupArray() and substring() instead of Postgres STRING_AGG() and SUBSTRING(... FROM ... FOR ...).
| Item ID | Input Preview | Trace ID | Output Preview | Scores |
|---|
Find items with missing or low scores:
SELECT dri.dataset_item_id, dri.trace_id, t.name AS trace_name
FROM dataset_run_items_rmt dri
LEFT JOIN traces t ON dri.trace_id = t.id AND dri.project_id = t.project_id
LEFT JOIN scores s ON dri.trace_id = s.trace_id AND dri.project_id = s.project_id
WHERE dri.project_id = '{PROJECT_ID}'
AND dri.dataset_run_id = '{RUN_ID}'
GROUP BY dri.dataset_item_id, dri.trace_id, t.name
HAVING COUNT(s.id) = 0;
Present:
{HOST}/project/{PROJECT_ID}/datasets/{DATASET_ID}/runs/{RUN_ID}{HOST}/project/{PROJECT_ID}/traces/{TRACE_ID}Suggest relevant next actions: compare against another run with compare-experiments, add evaluators with langfuse-eval-manager, or update consistently failing items with langfuse-dataset-expert.
Refer to references/experiment-data-model-reference.md for the complete data model.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub alex-kopylov/zweihander --plugin langfuse