evalview

Run regression testing for AI agents by capturing golden baselines of agent interactions and auto-detecting behavioral regressions after code, prompt, or model changes. Includes watch mode for live scorecard updates and MCP integration with OpenAI and Anthropic APIs.

The open-source behavior regression gate for AI agents.
Think Playwright, but for tool-calling and multi-turn AI agents.

Your agent can still return 200 and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. EvalView catches those silent regressions before users do — and gives you the loop to investigate them, grade the confidence, and broadcast the verdict to your team.

You don't need frontier-lab resources to run a serious agent regression loop. EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.

Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly. It tracks drift across outputs, tools, model IDs, and runtime fingerprints with graded confidence — not a binary alarm — so you can tell "the provider changed" from "my system regressed."

30-second live demo.

Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.

Catch silent regressions that normal tests miss
Separate provider/model drift from real system regressions
Auto-heal flaky failures with retries, review gates, and audit logs
Replay deterministically — cassettes capture real tool calls once so CI never re-hits live services

Built for frontier-lab rigor, startup-team practicality:

targeted behavior runs instead of giant always-on eval suites
deterministic diffs first, LLM judgment where it adds signal
faster loops from change -> eval -> review -> ship

How we run EvalView with this operating model →

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

The money screen is the one-line verdict that lands under every check — a single ship/don't-ship decision derived from the diff, quarantine state, cost delta, and drift confidence:

─────────────────────────────────────────────
 VERDICT: 🛑 BLOCK RELEASE
─────────────────────────────────────────────

  • 1 regression: billing-dispute
  • 1 test changed behavior: refund-request
  • Cost up 14% vs baseline

Likely cause & next actions:

  1. Rerun statistically to distinguish flake from real drift
     (high severity, high confidence)
     → evalview check --statistical 5

  2. Review tool descriptions for: escalate_to_human
     (high severity, high confidence)
     Tool selection changed — usually a prompt edit nudged the model
     → evalview replay refund-request --trace
     → evalview golden update refund-request   # if the new path is correct

Four tiers: SAFE_TO_SHIP, SHIP_WITH_QUARANTINE, INVESTIGATE, BLOCK_RELEASE. The verdict is part of --json output, the PR comment, and the cloud ship page — CLI, CI, and dashboard all tell the same story.

Quick Start

pip install evalview

evalview init        # Detect agent, auto-configure profile + starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

That's it. Three commands to regression-test any AI agent. init auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.

After check, the investigative loop:

evalview

Popularity

What's Inside

README

Quick Start

Confidence

Similar Plugins

agentic-usability

agent-validator

evaluate-agent

agentbreak

Popularity

Health & Quality

Similar Plugins

agentic-usability

agent-validator

evaluate-agent

agentbreak

self-care

vigiles