Skill

podcast

Generates a 60-second two-host podcast video from a URL or free-form topic, with 4 acts of multi-shot dialogue and optional voice cloning. Use when the user asks to make a podcast, review a URL, or create an interview-style clip.

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/pika:podcast <url-or-topic> [bg_img=] [host_a_img=] [host_b_img=] [voice_a=] [voice_b=] [use_avatar] [aspect_ratio=16:9]

User invocable

Model invocable

Inline context

Default effort

Argument hint<url-or-topic> [bg_img=] [host_a_img=] [host_b_img=] [voice_a=] [voice_b=] [use_avatar] [aspect_ratio=16:9]

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

4 acts × 15s each = 60s. Host A always LEFT, Host B always RIGHT. Accepts a URL **or** a free-form topic / brief.

SKILL.md

208 lines · ~3.2k tokens

Stats

LanguageHTML

Stars37

Forks6

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

/pika:podcast

4 acts × 15s each = 60s. Host A always LEFT, Host B always RIGHT. Accepts a URL or a free-form topic / brief.

Parameters

Param	Default	Notes
`input`	required	URL to review or free-form topic / brief (e.g. "I and Elon Musk talk about Mars")
`bg_img`	auto-generated	Podcast studio background
`host_a_img`	auto-generated	Host A portrait — see Real-person handling below
`host_b_img`	auto-generated	Host B portrait — see Real-person handling below
`voice_a`	`876341503281471517`	Kling preset or cloned voice ID for Host A
`voice_b`	`829837252279803904`	Kling preset or cloned voice ID for Host B
`use_avatar`	off	Clone user's identity voice as Host A via `clone_voice`
`aspect_ratio`	`16:9`	Output aspect ratio

Defaults — fire fast, no mid-flow confirmation

Use the param-table defaults silently for voices. voice_a defaults to the Kling preset 876341503281471517 and voice_b to 829837252279803904. Do not ask "which voice?" or "should I clone yours?" before firing — only honor explicit overrides (voice_a=, voice_b=, use_avatar).
Auto-generate any missing host portraits silently (Step 1's archetype prompts). Do not ask "should I generate a host image?" — just generate.
No "type yes to proceed" gates. Submit → render the 4 acts → return URL. Account credit balance + provider failover are the canonical guardrails. The --yes flag is accepted as a no-op for backward compatibility.
Topic-mode personas (Step 3) — when the user names a real public figure, follow Step 4 (Real-person handling) silently: archetype portrait by default, no auto-generated photographic likeness, no question to the user about likeness rights.

Local images on Claude Desktop

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as host_a_img / host_b_img, pause Step 1 and kindly send them this — something like:

Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your photo:

Paste a URL if it's already hosted (Imgur, S3, your site) — fastest

Attach the image file so I can upload it before generation.

When a local file arrives, convert it to a public URL with upload_asset and use the returned public_url as the parameter before Step 1. Already-hosted https://... URLs work as-is and skip this entirely.

If the user names a real public figure without attaching anything, do NOT auto-generate their likeness — Step 4 (Real-person handling) uses an archetype portrait instead.

Steps

0. Resolve input (empty-args menu)

Strip flags (--yes, --no-captions, etc.) and key=value parameters from $ARGUMENTS. If what remains is empty or whitespace-only, print this menu verbatim as your full response, then stop and wait for the user's next message — do NOT call any tool, do NOT proceed to Step 1, do NOT invent a topic or URL. If the stripped input is non-empty (a URL or any prose), skip this step silently and proceed to Step 1.

What would you like a podcast about? I can take any of:

A website URL (product page, docs site, launch page) — e.g. https://pika.art

A GitHub repo — e.g. https://github.com/anthropics/claude-code

A blog post / article URL — e.g. a recent piece you'd like discussed

A free-form topic or brief — e.g. "I and Elon Musk talk about Mars" or "two scientists debate AGI"

Reply with your choice and I'll generate a 1-minute two-host podcast video (4 acts × ~15s).

Tip: you don't need to type /pika:podcast — just say things like "make a podcast about ", "podcast review of ", or "I and talk about " and I'll fire this skill automatically.

When the user replies, treat their reply as the resolved input (URL or topic) and proceed to Step 1. Do not re-prompt.

1. Generate missing assets (parallel)

Generate only what's not provided. Default archetype prompts:

bg_img — modern podcast studio, two chairs, warm lighting, no people, 16:9
host_a_img — enthusiastic host, studio portrait, left-side framing, 1:1
host_b_img — pragmatic skeptic host, studio portrait, right-side framing, 1:1

If the input mentions specific personas (Step 3), tune the archetype to match the persona vibe — see Real-person handling below.

2. Resolve voice IDs (only if `use_avatar` is set)

Call identity_voice_info → { voice_id, platform, sample_url }
If sample_url is present: call clone_voice(voice_url=sample_url, voice_name="host_a_voice") → set voice_a to the returned Kling voice ID

3. Parse input mode — URL vs topic

Strip flags (--yes, --no-captions, etc.) and key=value parameters from $ARGUMENTS. Inspect what remains.

URL mode — input contains a https?:// URL:

Call capture_website on the URL.
Extract: product name, value prop, 2–3 specific features or facts, pricing, one jokeable detail.
Use these as the script's factual anchors.

Topic mode — input is free-form prose (no URL):

Treat the whole input as the brief. Parse for:
- Subject — what the conversation is about
- Hosts — explicit if mentioned ("I and Elon Musk", "two scientists", "Joe and Sarah"); otherwise use defaults (enthusiastic host + skeptic host)
- Angle — debate / interview / explainer / casual
- Concrete facts — any specific claims, numbers, dates, quotes the user gave
If no concrete facts are given, use 2–3 clearly framed observations or hypotheses to anchor jokes and the "wait, actually..." pivot. Do not present invented claims as facts; if factual accuracy matters for the topic, ask for a source or URL.
If the user says "I and X" or "me and X", Host A = the user (use use_avatar flow if not already, or default avatar) and Host B = X.

4. Real-person handling (topic mode only)

If the parsed input names a specific real public figure as a host (e.g. "Elon Musk", "Taylor Swift", "Joe Rogan"):

Default behavior: do NOT auto-generate that person's photographic likeness. Generate an archetype portrait matching the persona vibe — e.g. "tech-billionaire-energy CEO at a podcast desk" for an Elon-style host, "pop-star aesthetic" for a Taylor-style host. Clearly inspired-by, not impersonation.
Override: if the user explicitly provides host_a_img=<url> or host_b_img=<url>, use the provided image as-is. The user takes responsibility for likeness rights.
Voices: same logic — default to a generic Kling preset; only use a cloned voice when the user provides one (voice_a= / voice_b=) or invokes use_avatar (which clones the user's own voice for Host A).
Script tone: the dialogue can riff on the named persona's known public positions or vibe (e.g. Mars enthusiasm for Elon-style) — public-record opinions are fair game. Do NOT put specific defamatory, off-character, or fabricated-private-life statements in their mouth.

This guardrail keeps the skill creative ("I want a podcast where I argue with a tech CEO about Mars") without auto-generating deepfakes of named real people.

5. Write script

Write 4 acts × 2 lines (HOST_A / HOST_B). Each line ~10–12s of spoken dialogue.

Required (Matan rules — apply to both URL and topic modes):

One specific joke tied to a concrete detail (scraped fact in URL mode; topic-derived claim in topic mode)
One "wait, actually..." skeptic-flip moment
At least one mid-sentence interruption
Natural filler: "okay so", "wait", "right?", "i mean", "honestly"
Real reactions, not generic praise
Reference at least one actual feature name, price, claim, or quote
Natural ending — no forced "bye!"

Acts: Hook → Feature deep-dive → The Turn → Verdict (In topic mode the analogue: Hook → Substance → The Pivot → Verdict.)

6. Generate video acts (subagent, sequential)

Delegate to a subagent with all resolved assets and the script. The subagent runs acts 1→2→3→4 sequentially — do NOT parallelize.

Each act: one generate_reference_video call (kling-v3-omni, duration=15, sound=true). Pass reference_images=[bg_img, host_a_img, host_b_img] and voice_ids=[voice_a, voice_b]. Optional knobs (added by pika-mcp-server BACK-339, 2026-05-10): quality_mode: "pro" for higher-fidelity kling output (longer wall-clock; reserve for high-stakes renders), and kling_model to pin a specific kling family member if you need reproducibility across runs. Three shots:

Wide 5s: both hosts, no voice token
MCU-A 5s: <<<voice_1>>> '<HOST_A line>'
MCU-B 5s: <<<voice_2>>> '<HOST_B line>'

Emotional beats per act:

Act 1: A excited, B skeptical
Act 2: A gesturing/explaining, B questioning
Act 3: A firm, B surprised and reconsidering
Act 4: A satisfied, B conceding

After act 4, subagent calls edit_concat([act1, act2, act3, act4]) and returns the final video URL.

7. Output

Return the final video URL and a one-sentence verdict. Do not call add_captions — Whisper auto-transcription is unreliable on the domain-specific terms typical of podcast dialogue (product names, persona names, technical jargon). Native Kling Omni audio is the deliverable.

Rules:

voice_ids must be valid Kling voice IDs — never use name-style strings like Calm_Man
Host A always LEFT (<<<image_2>>>), Host B always RIGHT (<<<image_3>>>) — never swapped

Load-bearing phrases

These anchors keep the podcast output coherent across URL and topic modes:

Phrase	Where	Why load-bearing
`Host A always LEFT, Host B always RIGHT`	Layout and shot prompts	Prevents host identity swapping across the four separate act renders.
`4 acts × 15s each`	Overall structure	Keeps the concat predictable and avoids uneven act pacing.
`Hook → Feature deep-dive → The Turn → Verdict`	Script structure	Gives the episode a conversational arc instead of four disconnected reactions.
`wait, actually...` skeptic-flip moment	Script requirements	Creates the pivot that makes the podcast feel like a real exchange.
`Do not call add_captions`	Output rule	Avoids low-quality burned captions on fast two-host dialogue with names and jargon.

Engine choice: Kling v3-omni for native two-host dialogue

Use Kling v3-omni for the four acts because it supports native dialogue with two reference hosts and voice tokens in a single shot plan. The tradeoff is that acts run sequentially for consistency and can take longer than pure edit/composite flows. Do not add a separate caption or music layer by default; the value of this skill is the native spoken exchange.

Runtime expectations

Typical wall-clock is 8-18 minutes:

Step	Wall clock	Notes
Missing asset generation	30-90s	Skipped for provided background/host refs
URL/topic parse + script	1-3 min	URL mode depends on page fetch quality
Four Kling acts	6-14 min	Runs sequentially to reduce host/voice drift
Concat + return	30-90s	Final URL only; captions skipped by default

Examples

URL mode (review a website / repo / blog):

/pika:podcast https://pika.art
/pika:podcast https://github.com/anthropics/claude-code
/pika:podcast https://cursor.com use_avatar

Topic mode (free-form brief):

/pika:podcast Two AI researchers debate whether AGI arrives before 2030
/pika:podcast I and a Mars-obsessed tech CEO talk about colonization timelines
/pika:podcast interview with a seed-stage VC about what kills most startups
/pika:podcast podcast about quantum computing breakthroughs in 2026

Mixed (URL inside a topic prompt — agent prefers URL mode if a valid URL is found):

/pika:podcast podcast about https://pika.art with skeptical investor energy

podcast

Popularity

Invocation

Context Preview

SKILL.md

podcast

Popularity

Invocation

Context Preview

SKILL.md

/pika:podcast

Parameters

Defaults — fire fast, no mid-flow confirmation

Local images on Claude Desktop

Steps

0. Resolve input (empty-args menu)

1. Generate missing assets (parallel)

2. Resolve voice IDs (only if use_avatar is set)

3. Parse input mode — URL vs topic

4. Real-person handling (topic mode only)

5. Write script

6. Generate video acts (subagent, sequential)

7. Output

Load-bearing phrases

Engine choice: Kling v3-omni for native two-host dialogue

Runtime expectations

Examples

Similar Skills

/pika:podcast

Parameters

Defaults — fire fast, no mid-flow confirmation

Local images on Claude Desktop

Steps

0. Resolve input (empty-args menu)

1. Generate missing assets (parallel)

2. Resolve voice IDs (only if use_avatar is set)

3. Parse input mode — URL vs topic

4. Real-person handling (topic mode only)

5. Write script

6. Generate video acts (subagent, sequential)

7. Output

Load-bearing phrases

Engine choice: Kling v3-omni for native two-host dialogue

Runtime expectations

Examples

Similar Skills

2. Resolve voice IDs (only if `use_avatar` is set)

2. Resolve voice IDs (only if `use_avatar` is set)