By JayLBean
Supervised Prompt Producer — produces production-grade classification prompts through human-in-the-loop supervised prompt learning. Walks the user through consultation, baseline labeling, optimization loop with information-isolated subagents, and sacred-test-set finalization.
Executes bash commands
Hook triggers when Bash tool is used
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A Claude Code plugin for disciplined, human-in-the-loop supervised prompt learning: it turns a labeled baseline into a production-grade prompt you can defend in code review with evidence, not vibes.
spp handles classification (binary, multi-class, fixed-schema),
multi-field and hierarchical structured output, structured extraction
(variable-cardinality, span-grounded), and prompt decomposition (a managed
linear pipeline) — anything with a labeled ground truth and a mechanical metric.
It is deliberately not a generation, RAG, agentic, or prompt-search tool (see
DESIGN.md §7.1.3). For what shipped, see
CHANGELOG.md; for the roadmap and scope, DESIGN.md §7.1.
Prompt engineering by feel produces prompts that look good and fail in production — tuned against whatever rows the author remembered, shipping with failures clustered where no one looked. Automated optimizers (DSPy, APE) go the other way: they trust a metric and search, which works only if the metric is honest — and a metric computed against one model on one labeled set rewards learning the dataset's quirks and the model's style, both of which look like generalization until you swap a model or a data slice.
spp targets the two failure modes those approaches miss:
spp defends against it.spp documents and surfaces it.The methodology comes from a hair-loss-discourse classifier that produced a
Qwen-locked prompt at test F1 = 0.941. Run cross-family, it split:
| Model | F1 |
|---|---|
| Qwen3-14B (optimized target) | 0.941 |
| GPT-4o full | ≈ 0.91 |
| GPT-4o-mini | ≈ 0.76 |
The failures were not random — they clustered, and were length-correlated rather than purely capability-related: the prompt encoded a Qwen-specific length tolerance the GPT family did not share. That is model overfitting, caught and documented rather than shipped unmarked. Baseline overfitting is what a less-disciplined run on the same labels would have produced — caught here by the stratified split and the auditor.
Phase 1 Label baseline + adversarial label review
Phase 1.5 Stratified split (train / dev / sacred test)
Phase 2 Optimization loop:
propose edit from discrepancy analysis
→ AUDITOR: categorical or row-specific?
→ run on dev → overfitting guard
→ stop when dev plateaus or regresses
Phase 3 Final test on the sacred set, exactly once
→ REPORT.md + frozen prompt + documented limitations
Two properties are non-negotiable and are what separate spp from an automated
optimizer:
The pipeline is agentic, but the decisions that reshape it are human — the kickoff that configures the run, a mid-loop redesign when an agent flags a structural problem, and any change to the schema, ground truth, or model.
The four phases map to skills/run/phases/spp-{init,baseline,loop,finalize}.md;
the loop's internals are in spp-loop.md §4.
| Automated | You stay in the loop |
|---|---|
| Stratified split generation | Metric design |
| Running iterations against dev | Baseline labeling judgment |
| Discrepancy analysis | Decision criteria for ambiguous rows |
| Categorical-vs-row-specific auditing | Model selection |
REPORT.md generation | Whether an edit is generalized or reverted |
| Sacred-test-set protection | Production ship / no-ship |
A good fit when most of these hold (match ~three of five and it's worth trying):
npx claudepluginhub jaylbean/supervised-prompt-producer --plugin sppComplete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
20 SEO/GEO skills and 5 commands on one shared contract for keyword research, content creation, technical audits, schema markup, monitoring, quality gates, entity truth, and campaign memory.
Comprehensive SEO analysis plugin for Claude Code. 25 sub-skills (21 core + 1 orchestrator + 1 framework + 2 extension mirrors) and 18 sub-agents cover technical SEO, content quality, schema, sitemaps, Core Web Vitals, local SEO, backlinks, AI/GEO, ecommerce, hreflang, SXO, clustering, drift monitoring, and Google APIs. Includes optional MCP extensions, SPA-aware rendering, portability, and hardened SSRF/DNS-rebinding safe fetchers.
Modern R development skills for Claude Code - tidyverse patterns, rlang metaprogramming, Bayesian inference, performance optimization, and more