From togetherai-skills
LLM-as-a-judge evaluation framework on Together AI. Classify, score, and compare model outputs, select judge models, use external-provider judges or targets, poll results and download reports.
How this skill is triggered — by the user, by Claude, or both
Slash command
/togetherai-skills:together-evaluationsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an
Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an ad hoc prompt loop.
Core evaluation types:
This skill also covers external providers used as judges or targets when the workflow still runs through Together AI's evaluation system.
together-chat-completions for one-off inference or manual judge promptstogether-batch-inference for bulk offline generation rather than evaluationtogether-fine-tuning when the user wants to improve the model instead of just measure ittogether-dedicated-endpoints only if the evaluation target itself is a dedicated endpoint--eval-column, --model-a-column, or --model-b-column in the scripts--judge-model-source external, --eval-model-source external, or compare-side source flags--download-results in the scripts when you want the per-row JSONL locallytogether>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".check=False for eval uploads because local file validation can misclassify eval datasets.npx claudepluginhub togethercomputer/skills --plugin togetherai-skillsTeaches production-grade LLM-as-Judge evaluation: direct scoring, pairwise comparison, bias mitigation (position, length, self-enhancement), and evaluation pipeline design.
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.
Builds production-grade LLM-as-judge evaluation systems: direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring.