From vision-link
Use when the user mentions a video file (.mp4, .mov, .avi, .mkv, .webm), a YouTube URL, asks to watch/analyze/review a video, or references video content in conversation
How this skill is triggered — by the user, by Claude, or both
Slash command
/vision-link:video-perceptionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You have access to video understanding tools via the vision-link MCP server.
You have access to video understanding tools via the vision-link MCP server.
Before using any video tools, the user MUST have completed setup.
Check setup status by reading ~/.0labs-vision/config.json. If backend is "unconfigured" or the file doesn't exist:
🎬 Vision-link needs setup first. Please run:
/vision-link:setup-video-vision
- Quick Setup — One-click with best defaults (recommended)
- Advanced Setup — Configure every option
- Custom Setup — Pick specific settings
Settings can be changed anytime by running setup again.
Only proceed with video tools after setup is verified.
video_analyze — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.video_watch — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.video_detail — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.video_info — Get video metadata without processing.video_configure — Change settings (backend, resolution, enable_index, etc.).video_setup — Check/install dependencies.IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
Always start with video_info to get duration, resolution, and audio presence.
If the user gives a YouTube URL, pass the URL directly as path.
The MCP server downloads it with yt-dlp, prefers YouTube subtitles/auto-captions
for transcription, and falls back to the configured audio backend only when
captions are missing, empty, or suspiciously incomplete.
REQUIRED for videos > 30s: Call video_analyze BEFORE extracting any frames.
This is NOT optional — it gives you structural data to make smart extraction decisions.
Select filters relevant to the user's question:
| User intent | Filters to select |
|---|---|
| "What happens in this video?" | scene_changes, silence, transcription |
| "Find the scene transitions" | scene_changes, black_intervals |
| "Are there frozen/stuck parts?" | freeze, blur |
| "Is this a talking head or action?" | motion |
| "When does the music start?" | silence, loudness |
| "Analyze the lighting" | exposure |
| "Summarize this lecture" | transcription, scene_changes, silence |
| General / unclear intent | scene_changes, silence, transcription |
Always include transcription: true when the video has audio — the transcription
tells you WHERE to look visually.
Use the analysis results and transcription to plan your frame extraction strategy:
Call video_watch to extract frames:
fps: "auto" without view_sample — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.segments based on analysis data with variable FPS, and view_sample to limit initial frame count. You can always drill deeper with video_detail.Use video_detail to drill into specific moments:
view_sample: 3 to preview (first, middle, last frame)view if you need more detailWhen the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
YouTube URLs: Pass supported YouTube URLs directly as path. Treat
transcription_source: "youtube_subtitles" as stronger than
youtube_auto_captions; auto-captions can still have recognition errors.
You receive:
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub 0labs-in/vision-link --plugin vision-link