From watch
Downloads videos from YouTube, Instagram, X/Twitter, Vimeo, TikTok, or local paths, extracts frames and transcripts (via captions or on-device mlx-whisper), and lets Claude answer questions about the video content.
How this skill is triggered — by the user, by Claude, or both
Slash command
/watch:watch <video-url-or-path> [question]<video-url-or-path> [question]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs, gets a timestamped transcript (native captions first, then **local mlx-whisper** as fallback — runs on-device, no API and no key), and prints frame paths. You then `Read` each frame path to see the images and combine them with the transcript to answer the user.
You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs, gets a timestamped transcript (native captions first, then local mlx-whisper as fallback — runs on-device, no API and no key), and prints frame paths. You then Read each frame path to see the images and combine them with the transcript to answer the user.
/watch invocation, silent on success)Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python — the python3 command on Windows is the Microsoft Store stub and will not run the script.
Before every /watch run, verify that dependencies are in place:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check
This is a <100ms lookup. On exit 0, the script emits nothing — proceed to Step 1 without comment. Do NOT announce "setup is complete" to the user — they don't need a status message on every turn. The only acceptable user-visible output from Step 0 is when remediation is required.
On non-zero exit, follow the table:
| Exit | Meaning | Action |
|---|---|---|
2 | Missing binaries (ffmpeg / ffprobe / yt-dlp) | Run installer |
3 | No local whisper engine (mlx-whisper / openai-whisper) | Run installer, then tell user the pip3 command it prints |
4 | Both missing | Run installer |
The installer is idempotent — safe to re-run:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"
On macOS with Homebrew, it auto-installs ffmpeg and yt-dlp. On Linux/Windows, it prints the exact install commands for the user to run. For transcription it checks for a local whisper engine (mlx-whisper preferred on Apple Silicon, openai-whisper as a CPU fallback) and prints the pip3 install command if neither is present. No API key, no config file, no .env — transcription runs entirely on-device.
If no whisper engine is installed: run the installer and relay the exact pip3 install … command it prints (mlx-whisper on Apple Silicon, openai-whisper on Windows/Linux/Intel Macs — do not assume mlx, it only installs on Apple Silicon). If they don't want to install it, proceed with --no-whisper and tell them videos without native captions will come back frames-only.
Structured mode (optional): python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --json emits {status, missing_binaries, whisper_backend, has_whisper, platform} where status is one of ready | needs_install | needs_whisper | needs_install_and_whisper.
Within a single session, you can skip Step 0 on follow-up /watch calls — once --check returned 0, nothing about the environment changes between turns.
.mp4, .mov, .mkv, .webm, etc.) and asks about it./watch <url-or-path> [question].Step 1 — parse the user input. Separate the video source (URL or path) from any question the user asked. Example: /watch https://youtu.be/abc what language is this in? → source = https://youtu.be/abc, question = what language is this in?.
Step 2 — run the watch script. Pass the source verbatim. Do not shell-escape it yourself beyond normal quoting:
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<source>"
Optional flags:
--start T / --end T — focus on a section. Accepts SS, MM:SS, or HH:MM:SS. When either is set, fps auto-scales denser (see "Focusing on a section" below).--max-frames N — lower the cap for tighter token budget (e.g. --max-frames 40)--resolution W — change frame width in px (default 512; bump to 1024 only if the user needs to read on-screen text)--fps F — override auto-fps (clamped to 2 fps max)--out-dir DIR — keep working files somewhere specific (default: an auto-generated tmp dir)--cookies-from-browser B — read cookies from a local browser (chrome, firefox, safari, edge, brave, …) for login-gated sources--cookies FILE — path to a Netscape-format cookies.txt (alternative to --cookies-from-browser)--whisper mlx|openai-whisper — force a specific local Whisper engine (default: prefer mlx-whisper, fall back to openai-whisper)--no-whisper — disable the local Whisper fallback entirely (frames-only if no captions)Public videos (most of YouTube, Vimeo, TikTok, Loom, etc.) download with no auth. But some sources gate the download behind a login: Instagram, X/Twitter, age-restricted or private/unlisted YouTube, members-only content. Those need the user's own cookies.
Do NOT pass cookies pre-emptively. Always try the plain download first. Only reach for cookies when it fails with a login / private / 403 / "login required" / "rate-limit" error. The user never types the flag themselves — you add it and re-run. When that happens, walk the user through it (these are sub-steps of the main Step 2, not the main flow):
(a) Ask which browser they're logged into. "To grab this Instagram video I need to borrow the cookies from a browser where you're logged into Instagram. Which one are you logged in on — Chrome, Safari, Firefox, Edge, or Brave?" Supported values: chrome, firefox, safari, edge, brave, chromium, opera, vivaldi.
(b) Re-run with that browser (on Windows use python, not python3 — see Step 0):
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "https://www.instagram.com/reel/XXXX/" --cookies-from-browser chrome
(c) Handle the common per-browser snags (tell the user the specific fix, don't just retry):
(d) Manual fallback if browser extraction just won't cooperate (most reliable, works on macOS / Windows / Linux): guide the user to export a cookies.txt and pass it with --cookies:
instagram.com).~/Downloads/cookies.txt).python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<url>" --cookies ~/Downloads/cookies.txtPrivacy note to reassure the user: cookies are read live from their own machine and piped straight into the yt-dlp subprocess. The skill never copies, stores, logs, or transmits them anywhere. The cookies.txt file (if they used the manual fallback) stays on their disk — they can delete it after.
When the user asks about a specific moment — "what happens at the 2 minute mark?", "zoom into 0:45 to 1:00", "the first 10 seconds" — pass --start and/or --end. The script switches to focused-mode budgets, which are denser than full-video budgets (still capped at 2 fps):
Focused mode is the right call for:
Transcript is auto-filtered to the same range. Frame timestamps are absolute (real video timeline, not offset-from-start).
Examples:
# Last 10 seconds of a 1 minute video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" video.mp4 --start 50 --end 60
# Zoom into 2:15 → 2:45 at 3 fps (90 frames)
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 2:15 --end 2:45 --fps 3
# From 1h12m to the end of the video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 1:12:00
Step 3 — Read every frame path the script lists. The Read tool renders JPEGs directly as images for you. Read all frames in a single message (parallel tool calls) so you see them together. The frames are in chronological order with a t=MM:SS timestamp so you can align them to the transcript.
Step 4 — answer the user. You now have two streams of evidence:
captions = yt-dlp pulled native subs; whisper (mlx) or whisper (openai-whisper) = transcribed locally on-device).If the user asked a specific question, answer it directly citing timestamps. If they didn't ask anything, summarize what happens in the video — structure, key moments, notable visuals, spoken content.
Step 5 — clean up. The script prints a working directory at the end. If the user isn't going to ask follow-ups about this video, delete it with rm -rf <dir>. If they might, leave it in place.
The script gets a timestamped transcript in one of two ways:
ffmpeg -vn -ac 1 -ar 16000 -b:a 64k, ~0.5 MB/min) and transcribes it locally:
mlx-community/whisper-large-v3-turbo. Preferred on Apple Silicon: fast, runs on the GPU/Neural Engine. Same engine ig-scraper uses. Install: pip3 install mlx-whisper.base model on CPU. Cross-platform fallback when mlx isn't available. Install: pip3 install openai-whisper.The audio never leaves the machine. The script prefers mlx-whisper; override with --whisper openai-whisper. Language is auto-detected. Use --no-whisper to skip the fallback entirely.
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" (auto-installs ffmpeg/yt-dlp via brew on macOS; prints exact commands on Windows/Linux). If it reports no whisper engine, relay the exact pip3 install … command it printed (mlx-whisper only on Apple Silicon, openai-whisper elsewhere).--no-whisper set OR transcription failed). Script prints a hint pointing to setup. Proceed frames-only and tell the user.--start/--end rather than a sparse full-video scan.--cookies-from-browser <browser> using a browser the user is logged into (see "Login-gated sources" above). If it's region-locked or genuinely unavailable, tell the user plainly; do not keep retrying.This skill burns tokens primarily on frames. Order of magnitude:
--resolution to 1024 roughly quadruples the image tokens per frame. Only do it when necessary.If you already watched a video this session and the user asks a follow-up, do not re-run the script — you already have the frames and transcript in context. Just answer from what you have.
What this skill does:
yt-dlp locally to download the video and pull native captions when the source supports them (public data; the request goes directly to whatever host the URL points at)ffmpeg / ffprobe locally to extract frames as JPEGs and, when Whisper is needed, a mono 16 kHz audio clip--out-dir if specified) so Claude can Read them~/.cache/huggingfaceWhat this skill does NOT do:
--cookies-from-browser / --cookies for a login-gated source, and only to authenticate the yt-dlp download. Those cookies are read live and never copied, stored, logged, or transmitted by the skill.env, no config file, no secretsBundled scripts: scripts/watch.py (entry point), scripts/download.py (yt-dlp wrapper), scripts/frames.py (ffmpeg frame extraction), scripts/transcribe.py (caption parsing), scripts/whisper.py (local mlx/openai-whisper transcription), scripts/setup.py (preflight + installer)
Review scripts before first use to verify behavior.
npx claudepluginhub mathiaschu/watch --plugin watchExtracts scene-change frames, pacing metrics, and transcript from video URLs or local paths; produces structured report for editorial analysis.
Ingests video/audio from files, URLs, RTSP feeds, or desktop capture; indexes visual/spoken content for search; transcodes, edits timelines, generates assets, and creates real-time alerts.
Analyzes video files or YouTube URLs: extracts frames/audio, detects scenes/motion/silence/transitions via ffmpeg tools with structured workflow.