From open-science-skills
Guides LLM text classification for survey data: codebook design, zero/few-shot/fine-tuning selection, model choice, human-LLM hybrids, validation, reproducibility.
How this skill is triggered — by the user, by Claude, or both
Slash command
/open-science-skills:text-classificationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- Treat codebook design as the most consequential decision in the classification pipeline. LLMs struggle with loose instructions and revert to general-purpose definitions rather than following researcher-specific operationalizations (Halterman & Keith 2025).
none_of_above or uncodeable) for responses that are too vague, too short, or off-topic. Define this category as precisely as the substantive codes (Halterman & Keith 2025).Follow the decision framework from Chae & Davidson (2025), which maps document characteristics and available resources to the appropriate approach:
Zero-shot prompting: Use when classifying short documents with a large decoder model (GPT-4o, Llama3-70B+) and no labeled training data. Best for rapid prototyping and tasks where constructs are well-defined. GPT-4o achieves the best zero-shot performance across tasks (Chae & Davidson 2025).
Few-shot prompting: Add labeled examples to the prompt. Results are inconsistent — adding examples helps some models but degrades others (Chae & Davidson 2025). Always compare few-shot against zero-shot on a held-out sample before committing. Select diverse examples covering edge cases, not just prototypical instances.
Fine-tuning: Train a model on labeled data. Effective with as few as 100 hand-coded examples for smaller models (Chae & Davidson 2025). Fine-tuned smaller models (Llama3-8B, GPT-3 Davinci) can match GPT-4o zero-shot performance. Prefer this when you have labeled data and need cost-effective classification at scale.
Instruction-tuning: Combine detailed prompting with fine-tuning on paired instruction-output examples. Most powerful regime for complex tasks — instruction-tuned Llama3-70B surpasses GPT-4o zero-shot on stance detection (Chae & Davidson 2025). Requires more technical infrastructure but yields the highest accuracy.
When resources permit, test multiple regimes on the same pilot sample and select based on empirical performance, not assumptions.
gpt-4o-2024-08-06), not the model family name. Commercial models are modified or deprecated without notice — GPT-3 was withdrawn from OpenAI's API entirely (Barrie, Palmer & Spirling 2025; Chae & Davidson 2025)."Code this response:\n\n{text}").npx claudepluginhub scdenney/open-science-skills --plugin ossGenerates and tests LLM-driven hypotheses on labeled tabular datasets using HypoGeniC (data-driven), HypoRefine (literature+data), and Union methods with iterative refinement and Redis caching.
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.