From langsmith
Use when building evaluation pipelines for LangSmith. Covers creating evaluators (LLM-as-Judge, custom code), defining run functions to capture outputs/trajectories, and running evaluations locally or via LangSmith auto-run.
How this skill is triggered — by the user, by Claude, or both
Slash command
/langsmith:langsmith-evaluatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators.
Three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - capture agent outputs/trajectories; (3) Running Evaluations - locally with evaluate() or auto-run via uploaded evaluators.
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here
LANGSMITH_PROJECT=your-project-name
OPENAI_API_KEY=your_openai_key # For LLM as Judge
Python: pip install langsmith langchain-openai python-dotenv
CLI: curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh
Before writing ANY evaluator:
(run, example) — compares to expected values(run) — real-time quality checksEach evaluator returns ONE metric only. For multiple metrics, create separate functions.
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI
class Grade(TypedDict):
reasoning: Annotated[str, ..., "Explain your reasoning"]
is_accurate: Annotated[bool, ..., "True if response is accurate"]
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)
async def accuracy_evaluator(run, example):
run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}
example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {}
grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}])
return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}
def trajectory_evaluator(run, example):
run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}
example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {}
actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", [])
expected = example_outputs.get("YOUR_EXPECTED_FIELD", [])
return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"}
def run_agent(inputs: dict) -> dict:
result = your_agent.run(inputs)
print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}")
return {"output": result}
langsmith evaluator list
langsmith evaluator upload my_evaluators.py \
--name "Trajectory Match" --function trajectory_evaluator \
--dataset "My Dataset" --replace
langsmith evaluator upload my_evaluators.py \
--name "Quality Check" --function quality_check \
--project "Production Agent" --replace
langsmith evaluator delete "Trajectory Match"
Uploaded evaluators auto-run when you run experiments on the dataset.
from langsmith import evaluate
results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")
import { evaluate } from "langsmith/evaluation";
const results = await evaluate(runAgent, {
data: "My Dataset",
evaluators: [myEvaluator],
experimentPrefix: "eval-v1",
});
Local evaluate() | Uploaded to LangSmith | |
|---|---|---|
Python run type | RunTree → run.outputs | dict → run["outputs"] |
TypeScript run type | run.outputs?.field | run.outputs?.field |
| Environment | Full packages | Sandboxed, limited imports |
npx claudepluginhub xamuavila/golden-skillsBuilds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Evaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.