From learn-from-video
Use when the user provides a video URL (YouTube, Vimeo, Loom, or any yt-dlp-supported site) or local video file path and wants to extract knowledge from it. Triggers: 'learn from this video', 'watch this', 'what does this video teach', any YouTube/Vimeo/Loom URL with a learning intent, local .mp4/.mkv/.webm file paths with learning intent. Also triggers on /learn-from-video slash command. Do NOT use for: playing videos, downloading videos without learning intent, or audio-only files.
How this skill is triggered — by the user, by Claude, or both
Slash command
/learn-from-video:learn-from-videoThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract knowledge from any video by combining transcript analysis with visual keyframe inspection. Supports any yt-dlp-compatible URL or local video file.
Extract knowledge from any video by combining transcript analysis with visual keyframe inspection. Supports any yt-dlp-compatible URL or local video file.
$PLUGIN_ROOT for all subsequent commands:
PLUGIN_ROOT=$(find ~/.claude/plugins/cache -path "*/learn-from-video/*/scripts/check-deps.sh" 2>/dev/null | sort -V | tail -1 | xargs dirname | xargs dirname)
If $PLUGIN_ROOT is empty, report: "learn-from-video plugin not found. Install with: claude plugin install learn-from-video@etma-tools" and stop.$PLUGIN_ROOT/scripts/$PLUGIN_ROOT/docker/$PLUGIN_ROOT/references/script-interfaces.mdThe user provides: <url-or-path> [intent description] [flags]
Flags:
--force — re-process a video that was already learned--sensitivity low|medium|high — keyframe detection sensitivity (default: medium)--type terminal|slides|talking_head|screen_recording — override video type detection--skip-verify — skip currency verification of mentioned tools/packages--force-whisper — skip captions/subtitles, always use Whisper for transcription (higher quality but slower)--whisper-model <model> — override whisper model (default: auto-selected based on available VRAM)Intent: Free-form text describing what the video contains. If not provided, do a quick transcript scan and ask the user.
IMPORTANT: Use the built-in fallback tiers. Do NOT install Python packages, npm modules, or write inline scripts to work around failures. Each step has a fallback designed for when the previous tier fails. If all tiers fail, report it to the user — don't improvise.
Check if ~/.claude/learned/.config.json exists. If it does, read it and apply saved preferences.
If it does NOT exist (first run), run Step 1 (check-deps) FIRST to detect hardware, then present the user with personalized time estimates.
Detect hardware for estimates:
nvidia_runtime is true, probe VRAM:
VRAM_MB=$(docker run --rm --gpus all learn-from-video:full -c \
"python3 -c 'import torch; print(torch.cuda.get_device_properties(0).total_memory // 1048576)'" 2>/dev/null || echo "0")
GPU_NAME=$(docker run --rm --gpus all learn-from-video:full -c \
"python3 -c 'import torch; print(torch.cuda.get_device_properties(0).name)'" 2>/dev/null || echo "Unknown")
Then ask the user with their specific estimates:
"This is your first time using learn-from-video. One quick preference:
Transcription method: Whisper (local AI) produces significantly better transcripts than YouTube's auto-captions — proper sentences, correct technical terms, no duplicates.
Your hardware: [GPU_NAME] ([VRAM_MB]MB VRAM) / or "No GPU detected (CPU only)" Whisper model: [selected model]
Estimated transcription times on your hardware:
Video Length Whisper YouTube captions 5 min ~[X] instant 10 min ~[X] instant 30 min ~[X] instant 60 min ~[X] instant Always use Whisper for best quality? (y/n)"
Fill in [X] with calculated estimates based on detected hardware and model.
Save their choice to ~/.claude/learned/.config.json:
{
"always_whisper": true,
"gpu_name": "NVIDIA GeForce RTX 3090",
"vram_mb": 24123,
"whisper_model": "large-v3",
"created_at": "2026-03-20T01:00:00Z"
}
If always_whisper is true, treat it the same as --force-whisper for all future runs (user can still override per-run behavior with the flag or by editing the config).
Run: bash $PLUGIN_ROOT/scripts/check-deps.sh
Parse the JSON output. If exit code is non-zero, report the error and stop.
Note the nvidia_runtime and image availability for later steps.
Generate slug from URL/path:
VIDEO_TITLE=$(docker run --rm learn-from-video:light -c "yt-dlp --get-title '<url>'" 2>/dev/null)
VIDEO_CHANNEL=$(docker run --rm learn-from-video:light -c "yt-dlp --print channel '<url>'" 2>/dev/null)
Store both for use in metadata.json later.Check if ~/.claude/learned/<slug>/ exists:
--force: inform user "Already learned this video. Use --force to re-process." and stop.--force: proceed (will overwrite).Create the output directory: mkdir -p ~/.claude/learned/<slug>/keyframes
If --force-whisper is set OR always_whisper is true in ~/.claude/learned/.config.json: skip tiers 1 and 2 entirely, go straight to tier 3. Note: whisper needs the video file. Download it first if not local:
docker run --rm -v "$OUTPUT_DIR:/out" learn-from-video:light -c \
"yt-dlp -f 'bestvideo[height<=720][vcodec!*=av01]+bestaudio/best[height<=720]' -S 'vcodec:h264' --merge-output-format mp4 -o '/out/video.mp4' '<url>'"
Then pass $OUTPUT_DIR/video.mp4 to transcribe.sh. Clean up the video file after transcription completes.
Tier 1 — YouTube Transcript MCP:
mcp__plugin_learn-from-video_youtube-transcript__get_transcript with the URL and lang "en"[object Object], an empty array, empty text, or a result with 0 segments. ALL of these count as failures.~/.claude/learned/<slug>/transcript.jsonTier 2 — Embedded/auto subtitles:
extract-video.sh run in Step 4 Phase Dembedded_subs.srt or autosub.en.vtt exists in the output dirdocker run --rm -v "$OUTPUT_DIR:/data" learn-from-video:light -c "python3 << 'PYEOF'
import json, re, os
segments = []
for subfile in ['/data/embedded_subs.srt', '/data/autosub.en.vtt']:
if not os.path.exists(subfile): continue
text = open(subfile).read()
source = 'embedded-subs' if 'embedded' in subfile else 'yt-dlp-autosub'
# Parse SRT/VTT timestamps: 00:01:23,456 or 00:01:23.456
for m in re.finditer(r'(\d{2}):(\d{2}):(\d{2})[,.](\d{3})\s*-->\s*(\d{2}):(\d{2}):(\d{2})[,.](\d{3})\s*\n(.+?)(?:\n\n|\Z)', text, re.DOTALL):
start = int(m[1])*3600 + int(m[2])*60 + int(m[3]) + int(m[4])/1000
end = int(m[5])*3600 + int(m[6])*60 + int(m[7]) + int(m[8])/1000
txt = re.sub(r'<[^>]+>', '', m[9]).strip()
if txt: segments.append({'start': round(start,2), 'end': round(end,2), 'text': txt})
if segments: break
if segments:
json.dump({'source': source, 'language': 'en', 'segments': segments}, open('/data/transcript.json','w'), indent=2)
print(f'Parsed {len(segments)} segments')
else:
print('No parseable segments found')
PYEOF"
Tier 3 — Whisper local transcription:
--whisper-model, pass it as --model to the scriptbash $PLUGIN_ROOT/scripts/transcribe.sh <video-path> ~/.claude/learned/<slug>/ --device auto [--model <whisper-model>]--whisper-model, the script flag is --model--whisper-model specified):
large-v3 (best quality, ~3 min per 10 min video)medium (good quality, ~2 min per 10 min video)small (decent quality, ~1 min per 10 min video)base (basic quality, fastest)medium on CPU (~10-15 min per 10 min video)~/.claude/learned/<slug>/transcribe_progress.json — for long videos (>20 min on CPU), use Bash run_in_background and poll this file periodicallyUnified transcript format (all tiers produce this):
{
"source": "youtube-captions|embedded-subs|yt-dlp-autosub|whisper",
"language": "en",
"segments": [
{"start": 0.0, "end": 3.5, "text": "..."}
]
}
Phase A — Sample frames for classification:
Run: bash $PLUGIN_ROOT/scripts/extract-video.sh <url-or-path> ~/.claude/learned/<slug>/ --sample-only
This produces 10 evenly-spaced frames in samples/.
Phase B — Classify video type:
If user provided --type, use that. Otherwise, view all 10 sample images and classify each frame as one of:
terminal — dark background, monospace text, CLI outputslides — presentation slides, large text, uniform backgroundstalking_head — person on camera, webcam-stylescreen_recording — browser, IDE, mixed UI elementsNote transition points (e.g., "frames 1-2 are talking_head, frames 3-10 are terminal"). Determine the dominant type and any transitions.
Phase C — Select threshold:
Apply sensitivity setting (default: medium) to the threshold matrix:
| low | medium | high | |
|---|---|---|---|
| terminal | 0.15 | 0.08 | 0.04 |
| slides | 0.40 | 0.25 | 0.15 |
| talking_head | 0.50 | 0.35 | 0.25 |
| screen_recording | 0.25 | 0.15 | 0.08 |
For mixed-type videos: use the lowest applicable threshold (catches everything, may produce extra frames — acceptable, dedup handles it).
Phase D — Full extraction:
Run: bash $PLUGIN_ROOT/scripts/extract-video.sh <url-or-path> ~/.claude/learned/<slug>/ --threshold <selected>
The script handles scene detection, deduplication (phash, Hamming distance ≤3), and outputs keyframes + keyframes.json.
Read keyframes.json and transcript.json. For each keyframe, find the transcript segment with the closest start timestamp. Present the merged view as you analyze:
[00:12] frame_001_00m12s.png + "so first we open settings.json"
[01:45] frame_002_01m45s.png + "and add this hook configuration"
Read the full transcript text. View ALL keyframe images. With the user's intent in mind:
--skip-verify):
~/.claude/learned/<slug>/knowledge.md — structured document with sections for each knowledge area~/.claude/learned/<slug>/metadata.json:{
"url": "<original-url-or-path>",
"title": "<video-title>",
"channel": "<channel-name>",
"duration_seconds": 600,
"intent": "<user-provided-intent>",
"video_type": "terminal",
"processed_at": "2026-03-19T12:00:00Z",
"transcript_source": "youtube-captions",
"keyframe_count": 25,
"slug": "<slug>"
}
Report what was learned:
Learned and saved to ~/.claude/learned/<slug>/. Here's a summary:
[concise summary of extracted knowledge]
Want me to do anything with this?
- Generate skill(s)
- Add CLAUDE.md rules
- Create agent definition
- Apply config changes
- Nothing, I just needed to understand it
Wait for user response. If they want artifacts:
~/.claude/learned/<slug>/artifacts/, using proper YAML frontmatter. Ask "Install to etma-tools?" before copying anywhere.~/.claude/learned/<slug>/artifacts/claude-md-rules.md. Ask "Append to global or project CLAUDE.md?"~/.claude/learned/<slug>/artifacts/. Ask "Install to etma-agents?"~/.claude/learned/<slug>/artifacts/config-changes.md. Ask "Apply these changes?"Never auto-install. Always ask before modifying anything outside ~/.claude/learned/.
| Scenario | Action |
|---|---|
| Docker not running | Report error, suggest starting Docker |
| Download fails | Report "Couldn't download from this URL. Check the URL is valid and accessible." |
| No transcript (all tiers fail) | Report "Couldn't extract transcript." Offer visual-only analysis from keyframes (reduced quality). |
| Video requires auth | Report "Video requires authentication. Try providing cookies via yt-dlp." |
| Video >60 min | Warn about processing time, ask to confirm |
| Slug already exists | Report existing content, offer --force |
| Disk space low | Report available space, suggest cleanup |
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub eriktma/claude-plugins --plugin learn-from-video