Marketplace

mlfts

MLflow tracing & evaluation for AI coding agent sessions

npx claudepluginhub fmurray/mlfts

README

View full README on GitHub

1 Plugin

mlfts

0·

MLflow tracing & evaluation for AI coding agent sessions

3mo

v0.1.0

fmurray

Stats

Plugins1

UpdatedMar 23, 2026

Links

View on GitHub View Marketplace JSON

mlfts — MLflow Tracing & Evaluation for AI Coding Agents

mlfts extends mlflow tracing for coding agents (currently Claude Code) with tools to evaluate, judge, and improve agent quality over time.

What It Does

Trace — Enriches existing traces with git state, prompt versions, and environment metadata for reproduibility.

Version — Your CLAUDE.md instructions are registered as versioned prompts in MLflow's Prompt Registry, linked to each session trace. Correlate quality changes with prompt changes.

Evaluate — Run LLM judges and custom scorers against traced sessions. Build evaluation datasets, define quality metrics, and track agent improvement systematically.

Feedback — Tag and annotate traces directly from Claude Code. Log human assessments, flag issues, and build labeled datasets for evaluation.

Architecture

SessionStart hook                    Stop hook
┌─────────────────────┐              ┌─────────────────────────────────┐
│ log_cc_environment.py│              │ skip_skill_traces.py            │
│                     │              │   ├─ filters skill invocations  │
│ • git SHA + dirty   │   sidecar   │   ├─ delegates to MLflow hook   │
│ • CLAUDE.md hash    │──── .json ──▶│   ├─ applies cc_env.* tags     │
│ • skills hash       │              │   └─ creates companion trace   │
│ • register prompt   │              │       (links prompt version)   │
└─────────────────────┘              └─────────────────────────────────┘

                    Post-session
┌──────────────────────────────────────────────────┐
│ Evaluation & Quality Improvement                 │
│   ├─ LLM judges score session quality            │
│   ├─ Custom scorers for domain-specific metrics  │
│   ├─ Aggregated metrics (latency, tokens, errors)│
│   └─ Human feedback loop via skills              │
└──────────────────────────────────────────────────┘

Key design choices:

Sidecar pattern — SessionStart writes a JSON file keyed by session ID; the Stop hook reads and deletes it (hooks can't share memory)
Fail-safe — all hook functions swallow exceptions so they never break your Claude Code session
Skill trace suppression — /tag and /feedback invocations are filtered out to keep session traces clean

Quick Start

Prerequisites

Python 3.11+
uv package manager
Databricks workspace with Unity Catalog enabled
Databricks CLI profile configured

Install

Add the marketplace source and install the plugin:

/plugin marketplace add <repo-url>
/plugin install mlfts

Then configure your tracking URI and experiment name in .claude/settings.local.json:

{
  "environment": {
    "MLFLOW_TRACKING_URI": "databricks",
    "MLFLOW_EXPERIMENT_NAME": "/Users/<your-email>/my-experiment",
    "DATABRICKS_CONFIG_PROFILE": "your-profile"
  }
}

See QUICKSTART.md for detailed setup instructions.

Verify

uv run pytest tests/ -v

Usage

Once installed, tracing happens automatically. Just use Claude Code normally.

Tag a trace

/tag key1=value1 key2=value2

Log feedback

/feedback --name quality --value 5 --rationale "Great session"

Evaluate sessions

Use the agent-evaluation skill to run LLM judges against traced sessions — define scorers, build evaluation datasets, and track quality metrics over time.

Query traces in SQL

SELECT * FROM main.ml_traces.mlflow_traces
ORDER BY start_time_ms DESC
LIMIT 10;

Skills

Tracing & Annotation

Skill	Description
`feedback-trace`	Log human feedback/assessments on traces
`tag-trace`	Add key=value tags to traces
`instrumenting-with-mlflow-tracing`	Add tracing to Python/TypeScript code

Analysis & Debugging

Skill	Description
`analyze-mlflow-trace`	Debug and investigate a single trace
`analyze-mlflow-chat-session`	Analyze multi-turn chat sessions
`retrieving-mlflow-traces`	Search and filter traces
`querying-mlflow-metrics`	Aggregated metrics (latency, tokens, errors)

Evaluation & Quality

Skill	Description
`agent-evaluation`	Evaluate agent quality with LLM judges and custom scorers

mlfts

README

1 Plugin

mlfts

mlfts

README

mlfts — MLflow Tracing & Evaluation for AI Coding Agents

What It Does

Architecture

Quick Start

Prerequisites

Install

Verify

Usage

Tag a trace

Log feedback

Evaluate sessions

Query traces in SQL

Skills

Tracing & Annotation

Analysis & Debugging

Evaluation & Quality

Project Structure

1 Plugin

mlfts

Related Marketplaces

antigravity-awesome-skills

claude-code-workflows

claude-plugins-official

mlfts — MLflow Tracing & Evaluation for AI Coding Agents

What It Does

Architecture

Quick Start

Prerequisites

Install

Verify

Usage

Tag a trace

Log feedback

Evaluate sessions

Query traces in SQL

Skills

Tracing & Annotation

Analysis & Debugging

Evaluation & Quality

Project Structure

Related Marketplaces

antigravity-awesome-skills

claude-code-workflows

claude-plugins-official