Search everything...

Stats

Actions

Available In

spp

Name: spp
Author: jaylbean

By JayLBean

Supervised Prompt Producer — produces production-grade classification prompts through human-in-the-loop supervised prompt learning. Walks the user through consultation, baseline labeling, optimization loop with information-isolated subagents, and sacred-test-set finalization.

npx claudepluginhub jaylbean/supervised-prompt-producer --plugin spp

Popularity

Stars

Above avg

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Skills1

run

/run

Run the spp methodology against a classification task. Use when the user wants to produce a production-grade classification prompt through disciplined supervised prompt learning, has a labeled baseline available (or is willing to label one), and wants human-in-the-loop control over the process. Walks through four phases — consultation, baseline-and-splits, optimization loop, finalization — producing a frozen prompt and a REPORT. The user does not type slash commands per phase; the agent walks the methodology while the user reviews and approves at gates.

Hooks1

Event Hooks

Bash

1 hook across 1 event

Stats

Version1.0.1

ReleasedJun 17, 2026

LanguagePython

Stars1

MaintenanceExcellent

LicenseMIT

Last CommitJun 17, 2026

AddedJun 16, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

supervised-prompt-producer1

Safety Signals

Caution

Executes bash commands

Hook triggers when Bash tool is used

README

spp — Supervised Prompt Producer

A Claude Code plugin for disciplined, human-in-the-loop supervised prompt learning: it turns a labeled baseline into a production-grade prompt you can defend in code review with evidence, not vibes.

spp handles classification (binary, multi-class, fixed-schema), multi-field and hierarchical structured output, structured extraction (variable-cardinality, span-grounded), and prompt decomposition (a managed linear pipeline) — anything with a labeled ground truth and a mechanical metric. It is deliberately not a generation, RAG, agentic, or prompt-search tool (see DESIGN.md §7.1.3). For what shipped, see CHANGELOG.md; for the roadmap and scope, DESIGN.md §7.1.

Why

Prompt engineering by feel produces prompts that look good and fail in production — tuned against whatever rows the author remembered, shipping with failures clustered where no one looked. Automated optimizers (DSPy, APE) go the other way: they trust a metric and search, which works only if the metric is honest — and a metric computed against one model on one labeled set rewards learning the dataset's quirks and the model's style, both of which look like generalization until you swap a model or a data slice.

spp targets the two failure modes those approaches miss:

Baseline overfitting — the prompt fits your specific labels, not the class definition. Scores high on what you tuned against, collapses on similar-but-unseen data. A deal-breaker. spp defends against it.
Model overfitting — the prompt fits one model's instruction-following style. Fine if you know it and ship accordingly; dangerous if it ships unmarked. spp documents and surfaces it.

The example that motivates both

The methodology comes from a hair-loss-discourse classifier that produced a Qwen-locked prompt at test F1 = 0.941. Run cross-family, it split:

Model	F1
Qwen3-14B (optimized target)	0.941
GPT-4o full	≈ 0.91
GPT-4o-mini	≈ 0.76

The failures were not random — they clustered, and were length-correlated rather than purely capability-related: the prompt encoded a Qwen-specific length tolerance the GPT family did not share. That is model overfitting, caught and documented rather than shipped unmarked. Baseline overfitting is what a less-disciplined run on the same labels would have produced — caught here by the stratified split and the auditor.

How it works

Phase 1   Label baseline + adversarial label review
Phase 1.5 Stratified split (train / dev / sacred test)
Phase 2   Optimization loop:
            propose edit from discrepancy analysis
            → AUDITOR: categorical or row-specific?
            → run on dev → overfitting guard
            → stop when dev plateaus or regresses
Phase 3   Final test on the sacred set, exactly once
            → REPORT.md + frozen prompt + documented limitations

Two properties are non-negotiable and are what separate spp from an automated optimizer:

Per-stage information isolation. Each cognitive stage of an iteration runs in an isolated sub-agent with an explicit input allow-list: a discrepancy stage that reads disagreed rows and abstracts them into clusters by ID; a rule-edit stage that proposes the next prompt without ever seeing row content; an auditor that reviews the edits but never sees the new scores. State flows through files, not a shared context. The auditor's one question — is each edit categorical (a class of rows with an articulable property) or row-specific (a patch for one weird row)? — keeps the loop from fitting rows it never saw.
The sacred test set. Read exactly once, at finalization. The loop sees train + dev only.

The pipeline is agentic, but the decisions that reshape it are human — the kickoff that configures the run, a mid-loop redesign when an agent flags a structural problem, and any change to the schema, ground truth, or model.

spp workflow: the agentic loop and the human decisions that reshape it

The four phases map to skills/run/phases/spp-{init,baseline,loop,finalize}.md; the loop's internals are in spp-loop.md §4.

Automated vs. human

Automated	You stay in the loop
Stratified split generation	Metric design
Running iterations against dev	Baseline labeling judgment
Discrepancy analysis	Decision criteria for ambiguous rows
Categorical-vs-row-specific auditing	Model selection
`REPORT.md` generation	Whether an edit is generalized or reverted
Sacred-test-set protection	Production ship / no-ship

When to use it

A good fit when most of these hold (match ~three of five and it's worth trying):

View full README on GitHub

spp

Popularity

What's Inside

Confidence

README

spp — Supervised Prompt Producer

Why

The example that motivates both

How it works

Automated vs. human

When to use it

Similar Plugins

everything-claude-code

aaron-seo-geo

claude-seo

r-skills

spp — Supervised Prompt Producer

Why

The example that motivates both

How it works

Automated vs. human

When to use it

Popularity

Health & Quality

Similar Plugins

everything-claude-code

aaron-seo-geo

claude-seo

r-skills