From hyperframes
Composites cinematic captions into talking-head video with 32 visual identities (anchor, neon, glitch, etc.). Routes by identity, not mode. Requires hyperframes and a single-subject clip.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hyperframes:embedded-captionsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**One catalog, picked up front** ([CATALOG.md](CATALOG.md) — 17 identities; the three engines behind it are backend detail). **Standard** (default) builds a clean verbatim **rail** (lower-third subtitle carrying most text) + an **embed** climax composited _into_ the scene behind the subject at the peak. **Cinematic** is pure embed — no rail, every caption composited behind the subject (hero typ...
CATALOG.mdassets/brand/CDPR-fankit-terms.txtassets/brand/CyberpunkReplica.ttfassets/brand/cyberpunk-widths.jsonassets/fonts/char-widths.jsonassets/strokefonts/HersheyScript1.svgassets/strokefonts/HersheyScriptMed.svgdna/README.mddna/chrome.jsondna/cream.jsondna/documentary.jsondna/editorial.jsondna/glitch.jsondna/ink.jsondna/keynote.jsondna/loud.jsondna/neon.jsondna/velocity.jsonmodes/cinematic/README.mdmodes/cinematic/_archive/champion/spec.mdOne catalog, picked up front (CATALOG.md — 17 identities; the three engines behind it are backend detail). Standard (default) builds a clean verbatim rail (lower-third subtitle carrying most text) + an embed climax composited into the scene behind the subject at the peak. Cinematic is pure embed — no rail, every caption composited behind the subject (hero typography, accumulation, occlusion as the effect). Theme is a complete themed constitution — body paradigm × hero setpiece × front fx × plate reaction, composed from registries (themes/README.md): ordnance terminal neonsign stardust stomp. Most explainer / voiceover is Standard; embed is the scarce, earned peak — embedding every word is the common mistake; Theme is for VFX-grade asks ("炸", "特效", "像 AE 做的").
The craft prose below is long; the pipeline itself is short — and everything deterministic is computed or compiled, never hand-written:
hyperframes init (skip it if the project dir already exists with the video inside — matte.cjs/transcribe.cjs adopt any video in the dir as source.mp4) → bash scripts/prepare.sh <project> (matte ∥ transcribe ∥ audio-envelope in parallel, then safe-zones v2 with scene palette/optics/lighting — one command, nothing forgotten)safe-zones.json first):
Cinematic → plan.json → fill-timings.cjs → fit-fonts.cjs → make-composition.cjs;
Theme → theme.json → make-theme.cjs (rail/panel/poem/takeover paradigms; anchor is the quiet rail default)node scripts/preview-frames.cjs <project> → faithful composite previews in ~2s/frame (no render). Check § Visual QA before paying for a render.render-and-composite.sh → gates (timing / occlusion+hero / overflow / hand-off) → final.mp4Load-bearing rules people miss:
drop (filler, not shown) / rail (verbatim lower-third subtitle, in front, carries most text) / embed (a peak word composited behind the subject). Standard mode does both, embedding only the peak(s). See § Caption model.Every spoken phrase is one of three things:
| What | How it's shown | |
|---|---|---|
| drop | filler — um/uh, stutters, self-corrections | not shown |
| rail | the default — ordinary spoken content (verbatim) | clean lower-third subtitle, in front, readable. A punch word can get an inline emphasis highlight (accent colour / active-word pop) — it stays on the rail. |
| embed | a promoted peak — the headline beat | one big word composited behind the subject (matte occlusion), designed entrance + exit |
The rail carries most of the text; embed is the scarce, earned peak. Scarcity is per beat/block, not per clip: ≤1 hero per block (thought), never two co-visible, ≥ a beat of air between hero windows (the compiler warns under 0.6s). A short clip → usually 1–2; a long explainer → ~one per section. Among multiple heroes, the largest authored one is the APEX (it alone gets the full lockup embed + width-fit raise); smaller ones are MINOR peaks that ride their column as oversized emphasis lines (fg, damped motion) — not every beat needs the matte showcase, which is exactly what keeps the apex an event. Embedding every word is still the common mistake.
Rail-surface identities build exactly this (rail = rail.html, embed = the climax in index.html). Column-flow identities drop the rail and make everything embed-style — recommend them only for mood-over-verbatim asks, never for explainer / voiceover where the words must read (CATALOG.md encodes this per identity).
One front-end, three engines behind. The user picks an IDENTITY from CATALOG.md (17 entries: 12 classic + 5 themed); the engine, compiler and authoring file are derived by lookup from the catalog row. Never surface "Standard vs Cinematic vs Theme" as a question — those are backend names (a product has one UX even with several engines). The catalog encodes everything routing needs: reading surface, voice, recommend-for, scene needs, adjacency notes for the genuinely-close pairs (loud↔ordnance, neon↔neonsign, cream↔stardust).
Procedure: probe the clip → shortlist 2–3 identities from the catalog → recommend ONE with a one-line why → the user picks → author that identity's file. Identities are engine-locked (no cross combos; opening one is a validation event — see dna/README.md).
Always present your recommendation and let the user pick before you author. Don't silently default.
(The full identity table lives in CATALOG.md — single source of truth for routing. The engine docs below describe each backend's authoring contract.)
Recommendation heuristic: use the "Shortlisting heuristics" in CATALOG.md — they are identity-level (e.g. "炸" shortlists ordnance/stomp/terminal/loud and picks by WHAT should explode), never category-level. Unsure → anchor.
plan.json for a locked template, compiled by make-composition.cjs.theme.json, run scripts/render-theme.sh (compiles + renders + plate reaction → final_fx.mp4).Probe the video and classify the scene before either mode.
ffprobe <video.mp4> # specs
ffmpeg -ss <t> -i <video.mp4> -vframes 1 sample.png # at 20/50/80%
Read the samples. Refuse if:
transcribe.cjs warns when audio is near-silent (Whisper hallucinates words like "Thank you." over silence); heed it and refuse rather than caption fabricated wordsffmpeg -i in.mp4 -vf "fps=1,scale=160:-1,tile=10x5" sheet.png), don't trust 3 spot frames.transcript.json before authoring; if it doesn't parse as language, try WHISPER_MODEL=medium once, else refuse (a verbatim rail of fabricated words is worse than no captions).under 60 → light text reads as-is, 60-180 → add the glyph scrim, 180+ → opaque text + scrim (never bare light text). Cinematic templates are cream+screen and LOCKED — use this probe to pick a fitting identity (bright scenes → ink, or the opaque-rail anchor theme), never to recolour one.anchor (words read, scene safe) — but present a shortlist and let the user choose.1. hyperframes init <project> --non-interactive --video <video.mp4> --skip-skills
2. bash scripts/prepare.sh <project> # matte ∥ transcribe (parallel) → safe-zones. One command.
# → frames_fg/ transcript.json safe-zones.json
3. [AGENT STEP — the only creative step] author a small JSON; see below by mode
Cinematic: author plan.json → node scripts/fill-timings.cjs → fit-fonts.cjs → make-composition.cjs
Theme: author theme.json → bash scripts/render-theme.sh <project> (compiles + renders + plate fx)
4. node scripts/preview-frames.cjs <project> # ~2s/frame composite previews → § Visual QA (BEFORE the render)
5. bash scripts/render-and-composite.sh <project> # gates → final.mp4 + history/ snapshot
(Theme mode: SKIP steps 3b/5 — render-theme.sh already runs compile + render-and-composite
+ _postfx.sh; the deliverable is final_fx.mp4, final.mp4 is pre-plate-reaction)
Step 3 differs by mode:
safe-zones.json first. Narration planes go in zones.hugLeft/hugRight — clean strips ABUTTING the silhouette (text far from the body reads as floating, not embedded; far corners are the fallback, not the default). The hero defaults to heroAnchor/heroBands.best (centered ON the subject, ~30–55% occluded). recommendation:"fg" moves NARRATION in front for legibility; the hero stays embedded whenever heroBands.feasible — hero-fg is the last resort.ink; full pick guidance lives in the catalog, covering all ten incl. neon / glitch / chrome / velocity). State your pick + why; the user decides. The DNA locks type/palette/blend/motion + hero three-act; safe-zones v2 (palette/optics/lighting) parameterizes it to THIS scene automatically.<project>/cinematic.json — "dna": "<name>" + thought-BLOCKS, not raw groups: each block = lines of words (grouped 2–5 at clause boundaries) + the plane it stacks in + per-line css (size/weight/style only — no positions) + at most ONE line marked "hero": true (the promoted word; "text" for display form). Schema: scripts/make-cinematic.cjs header.node scripts/make-cinematic.cjs <project> — lowers blocks → plan.json → index.html. Generated for you: transcript-sequenced timings, accumulate-within-block, page-flip-between-blocks, the hero LOCKUP (a hero block's pre-context, HERO and post-context stack as ONE bonded composition centered on the subject — reading order top→bottom = spoken order by construction; context floats in FRONT while the hero embeds BEHIND = the depth sandwich; a mass rule keeps the hero dominating its context), apex/minor hero split, reading order by construction, fg fallback per safe-zones. Then the gates run as usual. (Hand-authoring plan.json directly remains possible for designs blocks can't express — then run fill-timings.cjs + fit-fonts.cjs + make-composition.cjs yourself.)Read themes/README.md FIRST — paradigm/setpiece registries, linkages, hard rules, and the exact theme.json schema.
themes/<name>.json has voice + when). State your pick + why; the user decides.<project>/theme.json — dna, lines (verbatim, transcript order; 1–5 words each — for takeover each line is one CARD), minors (emphasis words), hero:{match} (the climax word/phrase; leave it OUT of lines for embed setpieces, keep it IN for inline setpieces and panel+redact).bash scripts/render-theme.sh <project> — compiles (verbatim-completeness gate at compile time), renders both layers, composites, applies the plate reaction → final_fx.mp4. Use preview-frames.cjs between compile and render for Visual QA.node scripts/preview-frames.cjs <project> [t…] composites faithful preview frames in ~2s each
(caption layers screenshotted at seek-time + real video frame + matte occlusion + rail overlay = what
the final composite will look like at that moment). Default samples = each group/climax window.
A full render costs minutes — never use it to discover layout problems.
Check the previews (<project>/preview/sheet.png) against this list — these are the failures the
geometric gates cannot catch:
ink).Then the 5 positive checks in references/reference-bar.md (poster test · timid test · one-glance hierarchy · scene handshake · dead-air audit) — the failure list keeps a render from being broken; the positive list is what makes it designed. Ship when both pass.
Fresh-eyes review (recommended for anything user-facing): you have confirmation bias about your own layout. If you can spawn a subagent, give it ONLY the preview sheet + this checklist and ask for PASS/FIX verdicts per frame ("review these caption previews against the 5-point checklist; answer PASS or the specific fix per frame"). Apply fixes in plan.json / theme.json, recompile, re-preview — each loop costs seconds. Render once, when the previews pass.
Both modes draw from dna/ — six art-directed visual languages that parameterize per scene (accent sampled from the footage, contact shadow along the measured light direction, depth-match blur, RMS-coupled hero amplitude):
| DNA | Register | Scene fit | Voice |
|---|---|---|---|
| cream | premium-warm | dark/mid warm scenes | Inter + warm cream + screen; glowing emergence hero (successor of cinematic-cream) |
| ink | premium | bright scenes (luma > 150) | near-black multiply — type printed ON the wall; the bright-scene answer |
| editorial | editorial-luxe | introspective / fashion / poetic | Bodoni Moda, lowercase-italic hero — magazine elegance |
| keynote | tech-premium | product / launch | opaque white Inter 800, dead-center stillness |
| documentary | formal | interview / serious | burn-in reveals, no hero — gravitas IS the style |
| loud | loud | hype / sport / social | Anton + scene-sampled accent, single-unit slam + ripple; body ANNOUNCES in front (bodyLayer: fg) |
| neon | loud-cyber | cyberpunk / nightlife / tech-noir (dark scenes) | electric-cyan signage, ignition flicker, the hero powers ON like a sign |
| glitch | loud-cyber | digital / hacker / AI | RGB-split echoes snap together on landing; machine-percussive timing |
| chrome | loud-luxe | Y2K / fashion-tech / music | liquid-metal gradient hero + one sheen sweep during the hold |
| velocity | loud-sport | sport / auto / fitness | every word arrives along its motion vector (streak+skew), hero passes with speed trails |
Pick by safe-zones.json (heroAnchor.bandLuma, palette.temperature) × content
register — dna/README.md has the decision rule. Authoring:
cinematic.json takes "dna": "<name>".
The engine generates the hero three-act from the DNA (no authoring needed): co-visible captions dim (setup) → per-letter entrance with amplitude ∝ spoken loudness (impact) → breathe + glow until exit (afterglow).
(Legacy: plan.template:"cinematic-cream" maps to dna:"cream" automatically.
The retired 54-template library lives outside the skill at
~/Downloads/embedded-captions-archive/standard-templates-54/; _motion.md remains
in-skill as the motion-verb reference catalog.)
Classify the clip on 3 axes and feed the result into CATALOG.md's shortlisting — this section never picks a mode/engine by itself:
Tone (what feel does the content have?)
Shot (what's the framing?)
Platform (where will it play?)
Cross-reference in references/direction-catalog.md § Classification matrix for direction language — then return to CATALOG.md to shortlist identities (this matrix informs the shortlist; the catalog is the only routing surface).
The full embed-track playbook lives in references/composition-craft.md:
transcript role-annotation, phrase grouping, planes & clean-zone anchoring, zone coherence,
climax pop & readability, edge-breathing, the occlusion 3-step judgement, and
accumulation/persistence. It governs how a promoted phrase sits INTO the scene — read it
before authoring any embed (Cinematic plan.json or Standard index.html). The default rail
track has its own, much simpler spec → references/rail.md.
| Doc | What |
|---|---|
| references/rail.md | The rail track — standard lower-third subtitle spec (the default; carries most text). |
| references/composition-craft.md | The embed-track playbook — grouping, planes, climax pop, occlusion judgement, accumulation/persistence. Read before embedding. |
| dna/README.md | The DNA registry — six scene-parameterized visual languages; how to pick. |
| references/reference-bar.md | The taste bar — per-register world-class references + the 5 positive checks. |
| references/aesthetic-principles.md | The 18 rules. Beat Veed AI on taste. Read first. |
| references/motion-vocabulary.md | 10 named motion primitives + tone→timing lookup |
| references/direction-catalog.md | 10 ship-ready aesthetics + tone×shot×platform matrix |
| references/anti-patterns.md | Bugs already locked out (CoreML, letter-spacing reflow, etc.) |
| references/scene-types.md | When a wall surface is usable (4 conditions) |
| references/layout-heuristics.md | Plane positioning, clean-zone selection, crown 3 conditions, pillarbox math |
| references/typography-presets.md | Font-size × column-width matrix (starting points) |
| references/caption-grouping.md | Word → group rules (pauses, sentence boundaries) |
| references/failure-modes.md | Long tail of dev gotchas |
| references/bespoke-vs-presets.md | Why presets fail sometimes; clone-and-tweak pattern |
Read the aesthetic principles and direction catalog FIRST. Everything else is implementation detail.
Math.random(), no Date.now(), no repeat:-1.apex. climax = per-beat peak, not "the single payoff of the entire clip."remove-background, u2net_human_seg, Apache-2.0). Human segmentation by intent, but not surgically: thin offset furniture (mic boom arms) is usually excluded — captions render over it, behind the person — while large salient objects NEAR the subject (a telescope, a desk rig) can still leak into the matte and occlude captions. Objects HELD by the subject (products, phones) may drop out intermittently, letting captions pass in front. NEVER assume: sample frames_fg/ at 2-3 timestamps before placing the hero, and prefer hero positions clear of any leaked furniture (heroAnchor can be skewed by leaks — cross-check against frames_bg).heroAnchor.centerXPct off the person). Before authoring, extract ONE frame of each band you intend to use; if a prop lives there, measure its bbox and move/shrink the plane. Two real cases shipped clean only because the agent did exactly this. (Auto prop-saliency is a known gap; zones' peakLuma only catches moving bright objects.)check-overflow.cjs as a WARNING (intentional bleed is the only exception — read the warning).check-timing.cjs --strict before rendering (via render-and-composite.sh); THEME mode enforces the same timings at compile time instead (make-theme's sequential transcript matcher + verbatim completeness gate — drift is a compile error). Never pack multiple transcript words into one entry (e.g. "FUTURE OF" or an IT + line-break + ALL stack with one start/end) — the second word inherits the first's timestamp and fires early. Split them into separate word entries with their own timings, even if you want them on the same visual line (use CSS white-space / natural wrap instead of <br>). Creative substitutions where caption text ≠ transcript (e.g. "15%" replacing "fifteen percent") are supported — register them in CREATIVE_SUBS inside check-timing.cjs.group.in ≤ min(word.start) and group.out ≥ max(word.end) for every group. If group.in is later than a word's start, the word is silently delayed until the container mounts (we've shipped 800ms lag bugs from this). The validator enforces this.out ≤ the next group's in so only one is on screen; (c) deliberate layered typography — add "allow_overlap": true on one of the groups to silence the validator. The validator estimates each group's vertical bbox from its CSS and flags collisions. Pick (a) by default — it's what makes cinematic-cream feel like a poem accumulating, not a subtitle track replacing itself.screen and that DNA is locked (the plan can't recolour them) → on a bright backdrop they wash out, so pick ink (letterpress built FOR bright surfaces) or the anchor theme (opaque rail surface) rather than overriding a look.letter-spacing or filter:blur on word entrance — inline-block reflow causes line-jumps.packages/cli/dist/cli.js). Scripts auto-resolve the checkout: HYPERFRAMES_ROOT env → repo root if this skill ships inside hyperframes → ~/Downloads/hyperframes. Build with bun install && bun run build.uvx (no manual installs): transcription runs WhisperX through uvx (word-level timings; falls back per SKILL §transcription), and Theme's drawon setpiece shells python3 scripts/gen-stroke-path.py at compile time. Everything else runs on the toolchain hyperframes already ships: matting via the hyperframes CLI's remove-background (u2net_human_seg; weights auto-download once, ~168 MB, to ~/.cache/hyperframes/), image/alpha math via sharp, layout/occlusion/overflow via puppeteer, plus ffmpeg. The scripts auto-resolve these from the hyperframes checkout — nothing extra to install.uvx (word-level timings + alignment; no manual install — transcribe.cjs drives uvx whisperx). Falls back to an existing word-level transcript.json if present.matte.cjs / transcribe.cjs auto-resolve source.mp4 (or glob the clip / read hyperframes.json), so hyperframes init --video X.mp4 needs no manual rename.matte.cjs extracts at the source's native rate and records matte.fps; render-and-composite.sh uses that so the matte stays frame-aligned.matte.cjs shells the hyperframes CLI's remove-background, which downloads u2net_human_seg (~168 MB, Apache-2.0) once to ~/.cache/hyperframes/background-removal/models/. First prepare on a fresh machine needs network for that one download.If a hard dependency is missing, STOP and ask the user — don't silently skip steps.
npx claudepluginhub heygen-com/hyperframes --plugin hyperframesGenerates CapCut-style animated word-level captions for viral videos using FFmpeg and ASS subtitles from Whisper word timestamps. For TikTok, YouTube Shorts, Instagram Reels.
Creates video compositions, animations, title cards, overlays, captions, voiceovers, audio-reactive visuals, and scene transitions in HyperFrames HTML.
Packages existing talking-head/interview/podcast videos with timed, designed graphic overlay cards (titles, lower-thirds, data callouts, quotes, side panels) synced to transcript. Useful for adding professional graphic packaging to a video.