From model-evaluator
Compares multiple ML models on a shared test dataset, evaluating metrics, statistical significance, inference performance, costs, robustness, and generates a report with tables, rankings, and recommendations.
How this command is triggered — by the user, by Claude, or both
Slash command
/model-evaluator:compare-modelsThe summary Claude sees in its command listing — used to decide when to auto-load this command
# /compare-models - Compare ML Models Compare multiple ML models to select the best performer. ## Steps 1. Ask the user for the models to compare and the evaluation dataset 2. Load all models and verify they accept the same input format 3. Run inference with each model on the identical test dataset 4. Calculate the same metrics for all models for fair comparison 5. Create a side-by-side comparison table with all metrics 6. Perform statistical significance testing between model pairs (McNemar, paired t-test) 7. Compare inference performance: latency, throughput, memory footprint 8. Calcul...
Compare multiple ML models to select the best performer.
npx claudepluginhub rohitg00/awesome-claude-code-toolkit --plugin model-evaluator/llm-compareCompares a prompt across OpenAI Codex, Google Gemini, and Ollama LLMs, selecting models/context, verifying availability, and appending test instructions for review fixes.
/benchmarkRuns a FactorMiner benchmark in a specified mode — table1, suite, ablation, or cost pressure — against a validated dataset and folds results into a research note.
/explain-modelAnalyzes context to generate AI/ML task code with validation, error handling, performance metrics, insights, artifacts, and documentation.
/eval-modelRuns rigorous model evaluation: cross-validated metrics, confusion matrix, feature importance, and subgroup bias audit. Produces a draft report for data scientist review.