Skill

eval-evaluator

Evaluates code generation quality using ICE Score and Code Judge metrics. Assesses functional correctness, usefulness, and consistency of AI-generated code against requirements.

Python

developer-tools

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/omni:eval-evaluator

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.

Supporting Files

README.mdevals/evals.jsonprompts/code_judge_no_answer_v1.jinja2prompts/code_judge_no_answer_v2.jinja2prompts/code_judge_with_answer_v1.jinja2prompts/code_judge_with_answer_v2.jinja2prompts/ice_score_functional_correctness_no_answer.jinja2prompts/ice_score_functional_correctness_with_answer.jinja2prompts/ice_score_usefulness_no_answer.jinja2prompts/ice_score_usefulness_with_answer.jinja2references/config-example.jsonscripts/__pycache__/judge_model_metrics_standalone.cpython-311.pycscripts/evaluate_code.pyscripts/judge_model_metrics_standalone.pyscripts/prompts/code_judge_no_answer_v1.jinja2scripts/prompts/code_judge_no_answer_v2.jinja2scripts/prompts/code_judge_with_answer_v1.jinja2scripts/prompts/code_judge_with_answer_v2.jinja2scripts/prompts/ice_score_functional_correctness_no_answer.jinja2scripts/prompts/ice_score_functional_correctness_with_answer.jinja2

SKILL.md

203 lines · ~1.2k tokens

Stats

LanguagePython

Stars48

Forks2

MaintenanceGood

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Judge Model Evaluator Skill

This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.

When to Use

Evaluating AI-generated code against requirements
Assessing code quality in automated workflows
Comparing multiple code implementations
Getting structured feedback on code changes
Validating that code meets specification requirements

Quick Start

Ensure you have API access configured
Provide the requirement description and generated code
Optionally provide reference code for comparison
Get comprehensive evaluation scores

What You Need to Provide

Required Inputs

Requirements/Feature Description - What the code is supposed to do
- Format: List of change descriptions or natural language requirements
Generated Code - The code to be evaluated
- Format: String or structured code blocks with file paths

Optional Inputs

Reference Code - Correct/expected implementation (if available)
- Improves evaluation accuracy
- Format: List of code answers with file paths and snippets

Evaluation Metrics

ICE Score Components

Functional Correctness (0-1): Does the code correctly implement the requirements?
Usefulness (0-1): Is the code practical and well-structured?

Code Judge Components

Score (0-1): Overall code consistency and quality
Inconsistencies: Detailed list of issues found
- Severity levels: Small, Major, Fatal
Inconsistencies Count: Number of issues found

Overall Metric

Average LLM Judge Metric: Combined score across all metrics

Setup Instructions

1. Configure API Access

Create a configuration with your LLM API details:

{
  "api": {
    "url": "your-api-endpoint",
    "key": "Bearer your-api-key",
    "model_name": "model-name",
    "timeout": 60,
    "max_tokens": 16384,
    "temperature": 0.1
  }
}

2. Install Dependencies

pip install requests jinja2 loguru

Usage Examples

Example 1: Basic Evaluation

Input:

Requirements: "Add user authentication with JWT tokens"
Generated Code: Python implementation of auth system

Output:

{
  "avg_llm_judge_metric": 0.8333,
  "ice_score": {
    "functional_correctness": {"score": 0.75, ...},
    "usefulness": {"score": 1.0, ...}
  },
  "code_judge": {
    "score": 0.75,
    "inconsistencies": [...],
    "inconsistencies_count": 1
  }
}

Example 2: With Reference Code

Input:

Requirements: "Implement sorting algorithm"
Generated Code: Bubble sort implementation
Reference Code: Optimized quicksort

Benefit: More accurate evaluation with direct comparison

Error Handling

The skill handles common errors gracefully:

API connection failures
Invalid input formats
Template rendering errors
JSON parsing issues

All errors are reported in the output with descriptive messages.

Best Practices

Provide Clear Requirements: Be specific about what the code should do
Include Context: Add relevant domain information if needed
Use Reference Code: When available, significantly improves evaluation accuracy
Check API Configuration: Ensure proper API access before running
Review Detailed Feedback: Look at individual metric scores and justifications

Template Files

The skill uses optimized prompt templates stored in prompts/:

ICE Score templates for functional correctness and usefulness
Code Judge templates with/without reference answers
Both v1 and v2 variants for different evaluation strategies

Output Interpretation

Scores close to 1.0: Excellent implementation
Scores around 0.5-0.7: Acceptable with room for improvement
Scores below 0.5: Significant issues detected
Inconsistencies: Detailed feedback for improvement

Output template

评测结果

代码已使用xx模型完成评测，以下是详细结果：

📊 综合评分

平均LLM评测指标:

功能正确性:
实用性:
代码一致性:

🔍 详细分析

功能正确性 (0.5/1.0)

优点:
xxx

主要问题:
xxx

实用性 (0.75/1.0)

优点:
xxx

主要问题:
xxx

代码一致性 (0.0/1.0)

发现的不一致问题:
xxx

💡 改进建议

总结：xxxx

eval-evaluator

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

eval-evaluator

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Judge Model Evaluator Skill

When to Use

Quick Start

What You Need to Provide

Required Inputs

Optional Inputs

Evaluation Metrics

ICE Score Components

Code Judge Components

Overall Metric

Setup Instructions

1. Configure API Access

2. Install Dependencies

Usage Examples

Example 1: Basic Evaluation

Example 2: With Reference Code

Error Handling

Best Practices

Template Files

Output Interpretation

Output template

Similar Skills

Judge Model Evaluator Skill

When to Use

Quick Start

What You Need to Provide

Required Inputs

Optional Inputs

Evaluation Metrics

ICE Score Components

Code Judge Components

Overall Metric

Setup Instructions

1. Configure API Access

2. Install Dependencies

Usage Examples

Example 1: Basic Evaluation

Example 2: With Reference Code

Error Handling

Best Practices

Template Files

Output Interpretation

Output template

Similar Skills