From qa-llm-evaluation
Wires Langfuse tracing into LLM apps for production observability and offline eval - instruments via `@observe` (Python) / `startActiveObservation` (TS) decorators that auto-capture inputs / outputs / timings / errors per generation; exposes `langfuse.update_current_span()` for metadata + cost / latency annotation; supports trace-bound scoring for eval datasets and prompt-as-code management. Use when the user needs production LLM observability beyond pre-deploy eval, or wants to ship traces from production to an eval dataset for offline regression testing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-llm-evaluation:langfuse-tracingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
[lf-gh]: https://github.com/langfuse/langfuse-python
Langfuse complements pre-deploy LLM eval (Promptfoo / DeepEval /
Ragas / Giskard) with production-side observability - captures every
LLM call as a trace containing nested observations (generations,
spans, events), with token / cost / latency metadata, scores, and
linked datasets for offline eval (per lf-gh).
Important version note (2026-05-06): per lf-gh, "The SDK was rewritten in v4 and released in March 2026" - this skill targets the v4 API. For v3 codebases, see the upstream migration guide.
Per lf-gh:
pip install langfuse
For TypeScript:
npm install @langfuse/tracing
Set up project credentials per Langfuse self-hosted or cloud
project (LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY +
LANGFUSE_HOST).
@observePer langfuse.com/docs/sdk/python/decorators:
Python:
from langfuse import observe
@observe(name="llm-call", as_type="generation")
async def my_async_llm_call(prompt_text):
return "LLM response"
The decorator "automatically captures inputs, outputs, timings, and errors without modifying function logic" (per lf-py-deco).
TypeScript:
import { startActiveObservation, startObservation } from "@langfuse/tracing";
(Per lf-py-deco; the TS SDK uses an explicit
startObservation API rather than a decorator.)
Per lf-py-deco:
from langfuse import get_client
langfuse = get_client()
with langfuse.start_as_current_observation(as_type="span", name="data-processing"):
langfuse.update_current_span(metadata={"step1_complete": True})
Common metadata fields used in production:
model - the model name (e.g., claude-haiku-4-5)model_parameters - temperature / top_p / max_tokensusage - input/output tokens, costtags - environment (prod / staging), feature flag, customer IDlevel - DEBUG / DEFAULT / WARNING / ERRORScores attach evaluation results to a trace or observation. Three data types per Langfuse: numeric (0 - 1), categorical (string), boolean (true/false). The full API and example invocation live at langfuse.com/docs/scores; the trace-side wiring is:
langfuse.score(
trace_id="...",
name="answer_relevance",
value=0.87, # numeric
comment="Judged by GPT-4 rubric"
)
Per lf-scores the current Python SDK API is the source of truth - consult that page when wiring scores. Scores can come from:
trace_id)Langfuse datasets (collections of (input, expected_output) items)
can be:
Run a dataset:
items = langfuse.get_dataset_items(dataset_id="...")
for item in items:
actual = my_llm_app(item.input)
item.run(actual) # links the run back to the dataset for diff vs baseline
(API exact signature evolves; see langfuse.com/docs/datasets.)
Pin prompt versions in code; iterate prompts in the Langfuse UI;
roll out new prompt versions per environment (production /
staging labels) without code deploys. The langfuse.get_prompt()
API fetches the current production prompt at runtime.
Langfuse is observability-side, not pre-deploy CI-side. CI integration patterns:
answer_relevancy
score over a rolling window.These are dashboard / alerting wires (Langfuse → PagerDuty / Slack / Datadog), not CI-pipeline assertions.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Trace everything in production with no sampling | Cost explodes at scale | Use level=DEBUG + UI-side sampling (Step 3) |
| Score traces only via UI (no automated path) | Can't catch silent regressions | Automated langfuse.score() per trace (Step 4) |
| Pull production trace inputs without privacy review | PII leakage into eval datasets | Cross-ref qa-test-data/synthetic-pii-generator for fixture sanitization before promotion |
| Skip prompt versioning | Prompt drift breaks attribution | langfuse.get_prompt() with version pin (Step 6) |
| Conflate Langfuse with pre-deploy eval | Tries to be both; wins neither | Pair Langfuse (post-deploy) with Promptfoo/DeepEval/Ragas (pre-deploy) |
@observe decorator + observation
patternspromptfoo-evaluation,
deepeval-evaluation,
ragas-evaluation,
giskard-llm - pre-deploy eval sister
toolsprompt-eval-reviewer -
adversarial reviewer that flags eval suites without observability
feedback loopnpx claudepluginhub testland/qa --plugin qa-llm-evaluationProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.