Skill

video-perception

Use when the user mentions a video file (.mp4, .mov, .avi, .mkv, .webm), a YouTube URL, asks to watch/analyze/review a video, or references video content in conversation

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/vision-link:video-perception

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You have access to video understanding tools via the vision-link MCP server.

SKILL.md

116 lines · ~1.5k tokens

Stats

LanguageTypeScript

Stars2

MaintenanceExcellent

Last CommitMay 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Video Perception

You have access to video understanding tools via the vision-link MCP server.

CRITICAL: Setup Required First

Before using any video tools, the user MUST have completed setup.

Check setup status by reading ~/.0labs-vision/config.json. If backend is "unconfigured" or the file doesn't exist:

🎬 Vision-link needs setup first. Please run:

/vision-link:setup-video-vision

Quick Setup — One-click with best defaults (recommended)

Advanced Setup — Configure every option

Custom Setup — Pick specific settings

Settings can be changed anytime by running setup again.

Only proceed with video tools after setup is verified.

Available Tools

video_analyze — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.
video_watch — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.
video_detail — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.
video_info — Get video metadata without processing.
video_configure — Change settings (backend, resolution, enable_index, etc.).
video_setup — Check/install dependencies.

Workflow

IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.

Always start with video_info to get duration, resolution, and audio presence. If the user gives a YouTube URL, pass the URL directly as path. The MCP server downloads it with yt-dlp, prefers YouTube subtitles/auto-captions for transcription, and falls back to the configured audio backend only when captions are missing, empty, or suspiciously incomplete.

REQUIRED for videos > 30s: Call video_analyze BEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:

User intent	Filters to select
"What happens in this video?"	scene_changes, silence, transcription
"Find the scene transitions"	scene_changes, black_intervals
"Are there frozen/stuck parts?"	freeze, blur
"Is this a talking head or action?"	motion
"When does the music start?"	silence, loudness
"Analyze the lighting"	exposure
"Summarize this lecture"	transcription, scene_changes, silence
General / unclear intent	scene_changes, silence, transcription

Always include transcription: true when the video has audio — the transcription tells you WHERE to look visually.

Use the analysis results and transcription to plan your frame extraction strategy:
- Low FPS (0.1-0.5) for static or predictable segments
- Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
- Never exceed the minimum FPS needed for the task
- Prefer fewer segments at lower FPS — you can always drill deeper
Call video_watch to extract frames:
- For short videos (< 2 minutes): Use fps: "auto" without view_sample — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.
- For long videos (> 2 minutes): Use segments based on analysis data with variable FPS, and view_sample to limit initial frame count. You can always drill deeper with video_detail.
Use video_detail to drill into specific moments:
- Start with 3-5 second windows around points of interest
- Use view_sample: 3 to preview (first, middle, last frame)
- Then request specific timestamps with view if you need more detail
- Expand the window only if the initial view is insufficient
- Treat frame viewing like a binary search — narrow down to what matters
- Never view all extracted frames at once
When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.

Parameter Guide

fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.

resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.

segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.

view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.

skip_audio: Set to true when you only need visual analysis.

YouTube URLs: Pass supported YouTube URLs directly as path. Treat transcription_source: "youtube_subtitles" as stronger than youtube_auto_captions; auto-captions can still have recognition errors.

Working with Results

You receive:

Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
Frames as images — look at them to understand what's happening visually
Audio transcription with timestamps — read the speech content
Audio tags — non-speech events (music, sounds, etc.)
Analysis data — scene changes, silence intervals, motion levels, etc.

Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.

video-perception

Popularity

Invocation

Context Preview

SKILL.md

video-perception

Popularity

Invocation

Context Preview

SKILL.md

Video Perception

CRITICAL: Setup Required First

Available Tools

Workflow

Parameter Guide

Working with Results

Similar Skills

Video Perception

CRITICAL: Setup Required First

Available Tools

Workflow

Parameter Guide

Working with Results

Similar Skills