From omni
Evaluates code generation quality using ICE Score and Code Judge metrics. Assesses functional correctness, usefulness, and consistency of AI-generated code against requirements.
How this skill is triggered — by the user, by Claude, or both
Slash command
/omni:eval-evaluatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
README.mdevals/evals.jsonprompts/code_judge_no_answer_v1.jinja2prompts/code_judge_no_answer_v2.jinja2prompts/code_judge_with_answer_v1.jinja2prompts/code_judge_with_answer_v2.jinja2prompts/ice_score_functional_correctness_no_answer.jinja2prompts/ice_score_functional_correctness_with_answer.jinja2prompts/ice_score_usefulness_no_answer.jinja2prompts/ice_score_usefulness_with_answer.jinja2references/config-example.jsonscripts/__pycache__/judge_model_metrics_standalone.cpython-311.pycscripts/evaluate_code.pyscripts/judge_model_metrics_standalone.pyscripts/prompts/code_judge_no_answer_v1.jinja2scripts/prompts/code_judge_no_answer_v2.jinja2scripts/prompts/code_judge_with_answer_v1.jinja2scripts/prompts/code_judge_with_answer_v2.jinja2scripts/prompts/ice_score_functional_correctness_no_answer.jinja2scripts/prompts/ice_score_functional_correctness_with_answer.jinja2This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
Requirements/Feature Description - What the code is supposed to do
Generated Code - The code to be evaluated
Create a configuration with your LLM API details:
{
"api": {
"url": "your-api-endpoint",
"key": "Bearer your-api-key",
"model_name": "model-name",
"timeout": 60,
"max_tokens": 16384,
"temperature": 0.1
}
}
pip install requests jinja2 loguru
Input:
Output:
{
"avg_llm_judge_metric": 0.8333,
"ice_score": {
"functional_correctness": {"score": 0.75, ...},
"usefulness": {"score": 1.0, ...}
},
"code_judge": {
"score": 0.75,
"inconsistencies": [...],
"inconsistencies_count": 1
}
}
Input:
Benefit: More accurate evaluation with direct comparison
The skill handles common errors gracefully:
All errors are reported in the output with descriptive messages.
The skill uses optimized prompt templates stored in prompts/:
评测结果
代码已使用xx模型完成评测,以下是详细结果:
📊 综合评分
平均LLM评测指标:
功能正确性:
实用性:
代码一致性:
🔍 详细分析
功能正确性 (0.5/1.0)
优点:
xxx
主要问题:
xxx
实用性 (0.75/1.0)
优点:
xxx
主要问题:
xxx
代码一致性 (0.0/1.0)
发现的不一致问题:
xxx
💡 改进建议
总结:xxxx
npx claudepluginhub zte-aicloud/co-omnispec --plugin omniEvaluates code generation quality using ICE Score and Code Judge metrics for functional correctness, usefulness, and consistency. Useful for assessing AI-generated code against requirements.
Runs Agent-Ready Codebase Assessment scoring codebase across 8 dimensions with parallel agents, producing weighted 0-100 score, band rating, and improvement roadmap. Supports Ruby, Python, PHP, TypeScript, JavaScript, Go, Java, Scala, Rust.
Audits Claude Code packages across 6 quality dimensions (frontmatter, structure, content, triggers, handlers, testing) with scored reports. Supports single-package quick audits and full-repository comparative rankings.