Skill

continuous-eval

Evaluates RAG agents with Ragas for faithfulness, answer relevance, context precision, toxicity, and PII leakage post-deployment and hourly via cron. Manages golden datasets and regression gates blocking 5% drops for autopilot-deploy canary.

Python

Prometheus

ai-ml

testing

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agenticops:continuous-eval

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelclaude-sonnet-4-6

Tool Access

This skill is limited to the following tools:

ReadGrepBashmcp__cloudwatchmcp__prometheus

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- `autopilot-deploy`의 canary-10/canary-50 단계 gate 평가

SKILL.md

196 lines · ~1.7k tokens

Stats

LanguagePython

Parent stars10

Parent forks3

MaintenanceExcellent

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

When to Use

autopilot-deploy의 canary-10/canary-50 단계 gate 평가
매 1시간 cron 주기로 프로덕션 trace 품질 확인
self-improving-loop가 생성한 PR의 머지 전후 회귀 검증
Golden dataset이 업데이트된 직후 전체 agent 재평가

사용 제외:

Baseline 메트릭이 아직 축적되지 않은 신규 agent (최소 100개 golden sample 필요)
오프라인 개발 환경 — 본 skill은 프로덕션 trace에 대한 지속 평가가 목적

Prerequisites

Ragas (latest stable, pip install ragas) Python 런타임.
Golden dataset: .omao/plans/eval/golden/${target}.jsonl (최소 100 sample, 도메인 전문가 검증 완료).
Langfuse v3.x 프로덕션 trace 수집 완료.
LLM judge 모델 엔드포인트 — Ragas가 내부적으로 호출하는 평가용 LLM (Qwen3-7B 또는 Claude Sonnet 4.6 권장).
awslabs.prometheus-mcp-server==0.2.15 — 평가 결과를 Prometheus에 푸시하기 위한 경로 (@latest 금지, PyPI 버전 pin 필수).
Regression gate 임계값 설정 파일 (.omao/plans/eval/thresholds.yaml).

5개 핵심 지표

본 skill은 모든 대상 agent에 대해 동일한 5개 지표를 평가합니다. 각 지표의 정의와 허용 범위는 다음과 같습니다.

지표	정의	Baseline (권장)	Gate 임계값
Faithfulness	응답이 제공된 context와 사실적으로 일치하는 비율	0.85	baseline - 5pp
Answer Relevance	응답이 질문의 의도에 부합하는 정도	0.80	baseline - 5pp
Context Precision	검색된 context의 관련성 순위 품질	0.75	baseline - 5pp
Toxicity	유해·혐오 표현 포함률	0.0	0.0 (tolerance 0)
PII Leakage	개인정보 토큰 노출 비율	0.0	0.0 (tolerance 0)

Toxicity와 PII Leakage는 tolerance 0 정책을 사용합니다. 1건이라도 검출되면 즉시 gate 실패로 처리됩니다.

실행 흐름

1. 평가 모드 결정

3가지 모드 중 하나를 선택합니다.

mode=canary — 배포 gate에서 호출. Golden dataset + 최근 canary trace 샘플 혼합.
mode=hourly — cron 호출. 프로덕션 trace 최근 1시간 샘플만.
mode=full — 머지 직후 호출. Golden dataset 전체 + 프로덕션 trace 24시간.

2. 데이터셋 로드

import json
from datasets import Dataset
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas import evaluate

def load_dataset(target: str, mode: str) -> Dataset:
    samples = []
    if mode in {"canary", "full"}:
        with open(f".omao/plans/eval/golden/{target}.jsonl") as f:
            samples += [json.loads(line) for line in f]
    if mode in {"canary", "hourly", "full"}:
        samples += fetch_langfuse_traces(
            target=target,
            hours=1 if mode == "hourly" else 24,
        )
    return Dataset.from_list(samples)

fetch_langfuse_traces는 Langfuse REST API로 trace를 조회하여 {question, answer, contexts, ground_truth} 스키마로 변환합니다.

3. Ragas 평가 실행

ds = load_dataset(target, mode)
result = evaluate(
    ds,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result.to_pandas().describe())

4. Toxicity / PII Leakage 검증

Ragas metric과는 별도로 response 컬럼에 대해 두 가지 안전 검사를 수행합니다.

from presidio_analyzer import AnalyzerEngine
from detoxify import Detoxify

pii_analyzer = AnalyzerEngine()
toxicity_model = Detoxify("multilingual")

def safety_check(answer: str) -> dict:
    pii = pii_analyzer.analyze(text=answer, language="en")
    tox = toxicity_model.predict(answer)
    return {
        "pii_leakage": 1 if pii else 0,
        "toxicity": 1 if tox["toxicity"] > 0.5 else 0,
    }

5. Regression Gate 판정

Baseline과 비교하여 gate 통과 여부를 결정합니다.

def gate_decision(current: dict, baseline: dict) -> dict:
    fails = []
    for metric in ("faithfulness", "answer_relevancy", "context_precision"):
        if current[metric] < baseline[metric] - 0.05:
            fails.append(f"{metric}: {current[metric]:.3f} < baseline {baseline[metric]:.3f} - 5pp")
    for metric in ("toxicity", "pii_leakage"):
        if current[metric] > 0:
            fails.append(f"{metric}: positive instance detected")
    return {
        "decision": "pass" if not fails else "fail",
        "failures": fails,
    }

6. 결과 푸시

Prometheus remote_write: agenticops_eval_faithfulness, agenticops_eval_toxicity 등 gauge 메트릭.
.omao/plans/eval/results/${target}-${timestamp}.json 저장.
Gate 실패 시: incident-response 호출 (SEV3) + autopilot-deploy의 진행 중 배포 차단.

Golden Dataset 관리

Golden dataset은 본 skill의 기반입니다. 다음 규칙을 따릅니다.

버전 관리 — Dataset 변경은 반드시 PR 리뷰를 거친다. git log .omao/plans/eval/golden/ 로 변경 이력 추적.
소스 다양성 — 도메인 전문가가 직접 작성한 50%, 프로덕션 trace에서 승격한 50% 비율 권장.
PII 스캔 통과 — Presidio로 사전 스캔하여 PII가 포함된 샘플은 거부.
최소 크기 — 100 samples (신뢰구간 확보), 권장 300+.
재평가 주기 — 월 1회 domain expert 재검증, drift가 관찰되면 즉시 업데이트.

Example Inputs/Outputs

Input: /continuous-eval rag-qa-agent:v2.3.1 --mode canary

Output (pass):

[11:12Z] Mode: canary
[11:12Z] Dataset: 100 golden + 50 canary traces = 150 samples
[11:13Z] Ragas evaluation running (judge model: qwen3-7b)...
[11:14Z] faithfulness      = 0.89  (baseline 0.87, Δ +0.02) PASS
[11:14Z] answer_relevancy  = 0.84  (baseline 0.82, Δ +0.02) PASS
[11:14Z] context_precision = 0.78  (baseline 0.75, Δ +0.03) PASS
[11:14Z] toxicity          = 0     PASS
[11:14Z] pii_leakage       = 0     PASS
[11:14Z] Gate decision: PASS
[11:14Z] Pushed to Prometheus. Results at .omao/plans/eval/results/rag-qa-agent-20260421-1114.json

Output (fail):

[11:12Z] Mode: canary
[11:14Z] faithfulness      = 0.78  (baseline 0.87, Δ -0.09) FAIL (threshold -0.05)
[11:14Z] pii_leakage       = 1     FAIL (tolerance 0)
[11:14Z] Gate decision: FAIL
[11:14Z] Triggering incident-response (SEV3)
[11:14Z] Notifying autopilot-deploy: freeze canary progression for rag-qa-agent:v2.3.1

continuous-eval

Popularity

Invocation

Configuration

Tool Access

Context Preview

SKILL.md

continuous-eval

Popularity

Invocation

Configuration

Tool Access

Context Preview

SKILL.md

When to Use

Prerequisites

5개 핵심 지표

실행 흐름

1. 평가 모드 결정

2. 데이터셋 로드

3. Ragas 평가 실행

4. Toxicity / PII Leakage 검증

5. Regression Gate 판정

6. 결과 푸시

Golden Dataset 관리

Example Inputs/Outputs

참고 자료

공식 문서

기술 블로그

관련 문서 (내부)

Similar Skills

When to Use

Prerequisites

5개 핵심 지표

실행 흐름

1. 평가 모드 결정

2. 데이터셋 로드

3. Ragas 평가 실행

4. Toxicity / PII Leakage 검증

5. Regression Gate 판정

6. 결과 푸시

Golden Dataset 관리

Example Inputs/Outputs

참고 자료

공식 문서

기술 블로그

관련 문서 (내부)

Similar Skills