From ai-ml-eng-pro
Systematic LLM and ML model evaluation — benchmarks, metrics, regression detection, and model comparison. Use when assessing or comparing AI model quality.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-ml-eng-pro:model-evaluatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Provides a systematic framework for evaluating LLM and ML model performance. Supports standard benchmarks (MMLU, GSM8K, HumanEval, etc.), custom evaluation tasks, multi-dimensional metrics (accuracy, latency, cost, safety, fairness), regression detection across model versions, and head-to-head model comparison with statistical significance testing.
Provides a systematic framework for evaluating LLM and ML model performance. Supports standard benchmarks (MMLU, GSM8K, HumanEval, etc.), custom evaluation tasks, multi-dimensional metrics (accuracy, latency, cost, safety, fairness), regression detection across model versions, and head-to-head model comparison with statistical significance testing.
prompt-engineer — Model evaluation measures prompt quality improvementsdataset-curator — Evaluation datasets require the same curation rigor as training dataevaluating-llms-harness — lm-eval-harness for standard benchmark executionweights-and-biases — Experiment tracking for evaluation resultsSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub haj1t/senior-dev-squad-skills --plugin ai-ml-eng-pro