Evaluate and compare LLMs, ML APIs, and fine-tuned models for product fit across quality, latency, cost, compliance, and vendor risk dimensions. Use when selecting an AI model or vendor, comparing foundation model options, or making build-vs-API decisions for a product use case.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pm-ai-product-management:ai-model-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Help PMs systematically evaluate AI models (LLMs, ML APIs, fine-tuned models) for product fit using a structured framework.
Help PMs systematically evaluate AI models (LLMs, ML APIs, fine-tuned models) for product fit using a structured framework.
You are helping evaluate AI models or vendors for $ARGUMENTS.
Clarify the use case:
Identify candidate models / vendors:
Score each candidate on the evaluation matrix (see table below):
Assess quality benchmarks:
Model operational requirements:
Cost modelling:
requests/month × avg_tokens × price_per_tokenFine-tuning and customisation:
Data privacy and compliance:
Vendor risk:
Decision framework — build vs. API vs. fine-tune:
Produce evaluation report:
| Dimension | Weight | Candidate A | Candidate B | Candidate C |
|---|---|---|---|---|
| Task alignment / output quality | 25% | /5 | /5 | /5 |
| Latency (p95) | 15% | /5 | /5 | /5 |
| Cost at scale | 15% | /5 | /5 | /5 |
| Context window | 10% | /5 | /5 | /5 |
| Fine-tuning capability | 10% | /5 | /5 | /5 |
| Data privacy / compliance | 15% | /5 | /5 | /5 |
| Vendor lock-in risk | 5% | /5 | /5 | /5 |
| Deprecation / stability risk | 5% | /5 | /5 | /5 |
| Weighted total | 100% |
Think step by step. Save as markdown.
npx claudepluginhub tarunccet/pm-skills --plugin pm-ai-product-managementGuides AI/ML system design, LLM architecture, MLOps pipelines, provider selection (OpenAI, Anthropic, Hugging Face), model serving, RAG, agents, evaluation. Use for AI systems, MLOps, provider choice.
Queries OpenRouter API to list, search, compare, and resolve 300+ AI models by pricing, context lengths, capabilities, throughput; checks provider latency, uptime, performance.
Guides decisions on model sovereignty: prompting, RAG, fine-tuning (LoRA/QLoRA), distillation, local hosting for privacy, cost, and customization needs.