MLflow tracing & evaluation for AI coding agent sessions
npx claudepluginhub fmurray/mlftsMLflow tracing & evaluation for AI coding agent sessions
mlfts extends mlflow tracing for coding agents (currently Claude Code) with tools to evaluate, judge, and improve agent quality over time.
Trace — Enriches existing traces with git state, prompt versions, and environment metadata for reproduibility.
Version — Your CLAUDE.md instructions are registered as versioned prompts in MLflow's Prompt Registry, linked to each session trace. Correlate quality changes with prompt changes.
Evaluate — Run LLM judges and custom scorers against traced sessions. Build evaluation datasets, define quality metrics, and track agent improvement systematically.
Feedback — Tag and annotate traces directly from Claude Code. Log human assessments, flag issues, and build labeled datasets for evaluation.
SessionStart hook Stop hook
┌─────────────────────┐ ┌─────────────────────────────────┐
│ log_cc_environment.py│ │ skip_skill_traces.py │
│ │ │ ├─ filters skill invocations │
│ • git SHA + dirty │ sidecar │ ├─ delegates to MLflow hook │
│ • CLAUDE.md hash │──── .json ──▶│ ├─ applies cc_env.* tags │
│ • skills hash │ │ └─ creates companion trace │
│ • register prompt │ │ (links prompt version) │
└─────────────────────┘ └─────────────────────────────────┘
Post-session
┌──────────────────────────────────────────────────┐
│ Evaluation & Quality Improvement │
│ ├─ LLM judges score session quality │
│ ├─ Custom scorers for domain-specific metrics │
│ ├─ Aggregated metrics (latency, tokens, errors)│
│ └─ Human feedback loop via skills │
└──────────────────────────────────────────────────┘
Key design choices:
/tag and /feedback invocations are filtered out to keep session traces cleanAdd the marketplace source and install the plugin:
/plugin marketplace add <repo-url>
/plugin install mlfts
Then configure your tracking URI and experiment name in .claude/settings.local.json:
{
"environment": {
"MLFLOW_TRACKING_URI": "databricks",
"MLFLOW_EXPERIMENT_NAME": "/Users/<your-email>/my-experiment",
"DATABRICKS_CONFIG_PROFILE": "your-profile"
}
}
See QUICKSTART.md for detailed setup instructions.
uv run pytest tests/ -v
Once installed, tracing happens automatically. Just use Claude Code normally.
/tag key1=value1 key2=value2
/feedback --name quality --value 5 --rationale "Great session"
Use the agent-evaluation skill to run LLM judges against traced sessions — define scorers, build evaluation datasets, and track quality metrics over time.
SELECT * FROM main.ml_traces.mlflow_traces
ORDER BY start_time_ms DESC
LIMIT 10;
| Skill | Description |
|---|---|
feedback-trace | Log human feedback/assessments on traces |
tag-trace | Add key=value tags to traces |
instrumenting-with-mlflow-tracing | Add tracing to Python/TypeScript code |
| Skill | Description |
|---|---|
analyze-mlflow-trace | Debug and investigate a single trace |
analyze-mlflow-chat-session | Analyze multi-turn chat sessions |
retrieving-mlflow-traces | Search and filter traces |
querying-mlflow-metrics | Aggregated metrics (latency, tokens, errors) |
| Skill | Description |
|---|---|
agent-evaluation | Evaluate agent quality with LLM judges and custom scorers |
Claude Code marketplace entries for the plugin-safe Antigravity Awesome Skills library and its compatible editorial bundles.
Production-ready workflow orchestration with 84 marketplace plugins, 192 local specialized agents, and 156 local skills - optimized for granular installation and minimal token usage
Directory of popular Claude Code extensions including development tools, productivity plugins, and MCP integrations