From hybrid-loops
Find and design hybrid-loop surfaces in any project — places where an LLM's fuzzy semantic judgment is consequential enough to warrant typed structure (schema'd substrate, deterministic gating, calibration). Triggers on prompts like "build a tool that tracks/analyzes/evaluates/extracts X over time", "make sense of Y across many Zs", "detect patterns in W", "notice when X is happening", "score/rank/compare documents", "monitor my agent for X", "watch this stream and alert when...", "build a regression detector for...", "log and analyze evaluator outputs over time", "suggests when I'm repeating an anti-pattern", "flag for me when...", "build a system that learns from outcomes", or any project where the value is DATA ABOUT content rather than the LLM's reply. Diagnostic-first — most projects have only 0-3 such surfaces. Domain-agnostic. Do NOT trigger for one-shot transforms (translate, summarize, format), pure UIs, CRUD without judgment, code refactors, chatbots, agent harness setup, or generic prompt engineering.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hybrid-loops:hybrid-loopsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> A *cycle* of alternating LLM-and-code layers that *mutually generate each other's working surface* — not just constraining each other, but producing the very inputs the other half operates over. *LLMs bring fluency. Substrates bring discrimination. Code brings restraint.* The LLM writes typed records (often the schema or notation itself); the deterministic layer aggregates and shapes those re...
A cycle of alternating LLM-and-code layers that mutually generate each other's working surface — not just constraining each other, but producing the very inputs the other half operates over. LLMs bring fluency. Substrates bring discrimination. Code brings restraint. The LLM writes typed records (often the schema or notation itself); the deterministic layer aggregates and shapes those records into the input the next LLM call sees. They don't just gate each other — they manufacture each other.
The point isn't LLM-as-pipeline-stage. It's LLM-as-half-of-a-loop — and at scale, layered loops that wrap around each other. Runtime: one cycle resolving one judgment. Development-time: a critique-patch loop wraps around the runtime, with an LLM-panel reading transcripts of runtime behavior and patching the deterministic layer (or the lens prompts, or the substrate schema) below. The system grows by stacking such loops.
5-phase diagnostic:
- Find candidate surfaces in the project (places where fuzzy judgment is happening or should be)
- Scope each: A (just call an LLM), B (don't use an LLM), or C (hybrid loop)
- Choose shape: substrate-as-record (analytical) or substrate-as-vocabulary (interventional)
- Quick design in 3 questions (input, schema, action) with sane defaults for the rest
- Scaffold to the surface, not the project. Always include a calibration log and an ablation test.
Five roles in a cycle: lens (LLM extracts) → substrate (typed records accumulate) → gate (deterministic policy filters/scores/ranks) → reasoner (LLM consumes substrate) → action (deterministic effect; often loops back as new content). Plus two meta-layers: calibration (predict + verdict log — does the lens actually work?) and metabolism (substrate-wide audit — is the accumulated record drifting?).
Decline when: one-shot transform, chatbot, pure UI, no fuzziness in input, output discarded once, or a deployment shape that imposes a substrate on workers who can't edit it.
Most working programmers carry three mental primitives:
These are three cells in a much larger combinatorial space. Every mix of {LLM | code} as actor × {data | code} as input × {data | code} as output is a valid block, and most useful systems built today are graphs that span many cells rather than pipelines that occupy one. Almost nobody was trained for the combinatorial space. Schooling and working experience produce strong intuitions for the three classical cases and almost none for the multi-block dynamic-graph cases.
The LLM that's actually building the system is in roughly the same position as the programmer asking for it. Without explicit guidance, the LLM also defaults to pipeline thinking — extract once, decide once, return. This skill exists to push back on that default: to put the broader space in front of the LLM as a working option, and to scaffold the multi-layered dynamic graph the project actually wants. What blocks does this surface need? What should they generate for each other? Where does the cycle close? What wraps around it?
Systems that come out of this kind of design tend to feel a little organic — they grow rather than getting authored top-down, they adapt as they run and surprise you, they have metabolic phases (audit, prune, evolve) that aren't part of any single decision but keep the substrate fit over time. That's not poetic. It comes from the cycles being mutually generative: each layer keeps remaking the surface the others act on, and the system as a whole behaves more like an ecology than an engineered artifact.
A design pattern for the specific places in a project where a fuzzy semantic judgment benefits from typed structure. Most projects have 0–3 such places. The skill helps Claude identify them, decide whether each warrants the full pattern, and design what's there. Domain-agnostic — applies in health, education, ops, creative, social, business, engineering.
In one sentence: an LLM does fuzzy judgment, a typed substrate captures the result as data, deterministic code does aggregation/restraint/scoring, another LLM reasons over the substrate, an action lands. LLMs bring fluency. Substrates bring discrimination. Code brings restraint.
Note on naming. "Hybrid loops" is the working name in this repo; the broader field has no settled name (see references/PRIOR_ART.md for adjacent terms — "compound AI systems," "generalization shaping," etc.).
Walk the project. Name each candidate surface in one sentence. Signs:
Zero candidates → skill probably doesn't apply. Say so plainly.
Three buckets:
If all surfaces are A or B, this isn't a hybrid-loops project. Exit.
This is a diagnostic heuristic, not a hard binary. Most real surfaces end up as both — a typed library that also accumulates per-instance records — and the distinction is mainly useful for picking which side to design first. Lead with whichever the user-facing value is closer to (analytical → record-first; interventional → vocabulary-first), then add the other side once the first is working.
Three minimum questions. Everything else gets sane defaults; refine after a draft scaffold is on the table.
notes, model_id, schema_version).Defaults if not specified:
Produce a draft scaffold from the three answers. Iterate from there.
For the full design interview (when the surface is large enough that getting it wrong has material cost), see references/DESIGN_INTERVIEW.md.
The scaffolding is bounded to the surface; the rest of the project is whatever else it is. Minimum:
Implementation language follows the surrounding project. Deployment shape options: embedded in existing codebase, standalone module, MCP server, notebook/script, no-code wiring (Airtable + Zapier + an LLM call). Pick by user fit, not by familiarity.
The roles alternate between fluency (LLM) and discrimination (code), with each role generating the input the next role consumes. The arrows close a cycle:
notes field for graceful failure. Often the LLM also generates the schema itself (one tier up in development time).model_id + schema_version. The substrate is what makes the loop learn — each turn's records become the constraint surface for the next.Plus two meta-layers that close different loops:
The lens may be staged or parallel — treat lens as a role, not a single LLM call. Same for the reasoner.
Stacked loops. Many real systems have multiple cycles wrapping around each other. A common shape: a runtime cycle (engine + player + lens), wrapped by a development-time cycle (LLM-critic reads runtime transcripts, generates a patch plan, the patch plan modifies the lens prompts / substrate schema / gate policy / engine code, and the next runtime turn picks up the change). The development-time loop is itself a hybrid loop. See references/STACKING.md (which also covers a deeper "recursive harness authoring" regime for stacks that go past v0).
The five roles are an opinionated default arrangement of more general primitives. The eight primitive blocks come from {LLM | code} × {reads data | reads code} × {produces data | produces code}; pairs and triples compose into recognizable patterns. The full algebra and worked examples live in references/BUILDING_BLOCKS.md. The catalog of named graphs (RAG, ReAct, codegen-with-verification, the canonical 5-role loop, dev-time critique loop, knowledge-base auditor, teacher's intervention tracker, etc.) lives in references/BLOCK_GRAPHS.md.
The diagnostic in this skill defaults to the five-role shape because it covers most analytical and interventional cases. When your case wants something else — LLM-as-architect, code-as-perceiver, LLM-audits-code, LLM-generates-prompts-for-LLM — name the blocks you need and snap their I/O together. The cycle is the structural invariant; which blocks fill it is project-specific.
The dominant workflow for hybrid-loop projects isn't authoring a formal architecture document upfront and then implementing it. It's iterative:
When this skill fires, it's because the user is at step 1. The role of the skill is to help that brainstorming converge on a shape that's likely to work — naming the recurring blocks, flagging anti-patterns, suggesting where to put the calibration log so step 2 onward goes well.
How does the loop fire? Pick one:
For non-Claude-Code contexts (API apps, web services, mobile, no-code), ask: where in the host project's normal control flow does this surface get called? That's the activation point.
Hybrid loops separate judgment substrate (the typed library / record) from judgment execution (the LLM that picks or produces from it). That separation has political consequences depending on who owns each. This section addresses one axis — substrate-owner-vs-executor power balance — and is named accordingly. Real deployment ethics in any domain (privacy, accountability, accessibility, recourse, ongoing monitoring) is broader than what's here and lives outside the framework's scope; treat this section as a single decision-point at scaffolding time, not as full ethics coverage.
Ask, before scaffolding:
Power-neutral deployments: single user owns and executes; team shares ownership and execution; platform owns substrate but executor can edit.
Concerning: platform owns substrate, gig-worker or low-status executor must apply the imposed taxonomy without editing. That's a deskilling architecture and a misapplication of the pattern. Recommend redesign — give the executor edit authority, or refuse the project shape.
Substrate-for-agents deployments (where the typed library is consumed by automated agents rather than imposed on human workers) avoid this. Platform-owned interventional libraries imposed on workers do not.
Every Bucket-C surface should be able to answer: if I removed the typed substrate and just gave the same LLM raw content, would performance drop?
def test_ablation_substrate_helps():
typed_score = run_with_substrate(test_input)
raw_score = run_without_substrate(test_input)
assert typed_score > raw_score, "Substrate is not earning its keep"
Define "performance" per project. For a recruiter-triage tool: rate at which top-5 recommended candidates pass screen vs. random-5 baseline. For a writer's voice-checker: rate at which annotations are accepted by the writer. For an ambient finding-injection hook: whether downstream model turns produce meaningfully different reasoning when given vs. not given the findings.
Add the test from day one. Without it, the architecture is theater.
Decline the pattern when:
Decline template:
This sounds like a [transform / chatbot / refactor] task — the value is the LLM's natural-language output, not data about content. A simpler approach is [alternative]. The hybrid-loops pattern would add complexity without buying reliability here.
Also decline when the deployment power-balance check (above) flags a deskilling shape that the user won't fix.
Ask the user: Is the value of this surface the data it produces about content, or the natural-language output the LLM gives you?
Data → hybrid loop fits. Natural-language → it doesn't.
Use these terms internally:
When talking to the user, prefer their domain vocabulary. Keep the pattern terms internal.
references/PRIOR_ART.md.What's plausibly new — empirical generalization of the disciplines beyond the engineering surfaces they were sketched on — remains untested.
The main file is enough for project planning and most design conversations. Load references only when needed:
references/BLOCK_GRAPHS.md — when scaffolding by analogy to a recognizable shape; library of canonical hybrid-loop block-graphs each shown as a Mermaid diagram with brief unpackingreferences/EXAMPLES.md — six worked examples with full schema sketches, gate policies, and calibration paths (teacher/parent/advocacy/coach/writer/recruiter); complementary to BLOCK_GRAPHS.md (which gives diagrams; EXAMPLES.md gives implementation-level detail)references/DESIGN_INTERVIEW.md — when Phase 4's quick design isn't sufficientreferences/BUILDING_BLOCKS.md — when the user wants to think about the lower-level algebra (eight primitive block-shapes, pairs, triples, why neither half collapses to the other)references/THE_CASE.md — when the user is skeptical that hybrid loops are anything more than 1945 von Neumann; the algebra-vs-alphabet-vs-disciplines argument lives herereferences/AGENT_FRAMEWORKS.md — when the user thinks this is just DSPy / LangGraph / AutoGen / CrewAI / pydantic with a coat of theoryreferences/PRIOR_ART.md — when defending novelty or citing lineagereferences/STACKING.md — only when project is past v0 and explicitly stacksreferences/PROCEDURE.md — when designing the scaffolding around the substantive layers (standards of proof, standing checks, precedent retrieval, bounded action, recusal); legal-procedural lineage applied to hybrid loopsreferences/ORCHESTRATION_SHAPES.md — when scaffolding around multi-agent or parallel-session orchestration patterns in 2026 practice (Conductor-style multiplexing, Gas-Town/Gas-City-style engineered-resilience autonomy, Wasteland-style federated validation); four-shape map with citation discipline and source-tiering; vocabulary + research notes, not adoption guideProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub justinstimatze/hybrid --plugin hybrid-loops