claude-video-vision

Enable Claude to analyze local video files or YouTube URLs with full multimodal perception: extract frames and audio via FFmpeg, detect scenes/motion/silence/transitions, transcribe speech using Whisper or Gemini, generate detailed visual descriptions, summaries, and answer questions interactively.

claude-video-vision

Claude Code Video Vision

Give Claude the ability to watch and understand videos.

A Claude Code plugin that extracts frames via ffmpeg and processes audio via multiple backends (Gemini API, local Whisper, or OpenAI API). Claude receives frames as images and audio transcription with timestamps — the plugin is a perception layer, not an interpretation layer.

Features

Multimodal perception — Claude sees video frames directly and reads audio transcriptions with timestamps
YouTube URL support — Pass a YouTube URL directly; the MCP server downloads it with yt-dlp, preserving source metadata and captions for context
Flexible backends — Choose between cloud APIs or fully local processing
Adaptive extraction — Claude adjusts fps, time range, and resolution based on your question
Auto-installation — Whisper models download automatically on first use
Interactive setup wizard — /setup-video-vision walks you through configuration

Quick Start

1. Install the plugin

Inside Claude Code, run these commands one at a time:

/plugin marketplace add https://github.com/jordanrendric/claude-video-vision

Then:

/plugin install claude-video-vision

The MCP server will auto-install via npx from npm on first use — no build step required.

Alternative: local development

git clone https://github.com/jordanrendric/claude-video-vision.git
claude --plugin-dir /path/to/claude-video-vision

2. Configure

Inside Claude Code, run the interactive wizard:

/setup-video-vision

It will walk you through backend selection, whisper configuration (if local), frame options, and dependency verification.

Usage

Slash command

/watch-video path/to/video.mp4
/watch-video tutorial.mp4 "what language is used in this tutorial?"
/watch-video https://www.youtube.com/watch?v=... "summarize this video"

Conversational

Just mention a video file or YouTube URL — Claude will detect it:

"analyze this video for me: ~/Downloads/demo.mp4"

"take a look at the first second of ~/videos/bug-report.mov"

"summarize this YouTube Short: https://www.youtube.com/shorts/..."

Claude adapts parameters automatically:

"the first second" → extracts at original fps from 00:00:00 to 00:00:01
"summarize this 1h lecture" → low fps, full duration
"what text is on screen at 1:30?" → high resolution, narrow time window

Backends

Backend	Audio processing	Cost	Setup
Gemini API	Native (speech + non-speech events)	Free tier: 1500 req/day	`GEMINI_API_KEY` env var
Local (Whisper)	`whisper.cpp` or Python `openai-whisper`	Free, fully offline	`brew install whisper-cpp` + auto model download
OpenAI API	OpenAI Whisper API	Paid per usage	`OPENAI_API_KEY` env var

All backends extract video frames via ffmpeg — Claude always has direct visual access.

Architecture

┌───────────────────────────────────────────────────────┐
│ Claude Code (your session)                            │
│                                                       │
│  /watch-video  ──→  Skill: video-perception          │
│                        │                              │
│                        ▼                              │
│                  MCP tool: video_watch                │
│                        │                              │
└────────────────────────┼──────────────────────────────┘
                         │
                         ▼
      ┌────────────────────────────────────┐
      │ MCP Server (Node.js)               │
      │                                    │
      │  ┌──────────┐    ┌──────────────┐  │
      │  │ ffmpeg   │    │ Audio backend│  │
      │  │ frames   │ ║  │ (parallel)   │  │
      │  └──────────┘    └──────────────┘  │
      │       │                 │          │
      └───────┼─────────────────┼──────────┘
              ▼                 ▼
        base64 images     transcription
        + timestamps      + audio events
              │                 │
              └────────┬────────┘
                       ▼
              Claude receives both

Requirements

Node.js 20+ (for the MCP server)
ffmpeg (auto-detected, install instructions provided by setup wizard)
yt-dlp (optional, required only for YouTube URLs; brew install yt-dlp on macOS)
Backend-specific:
- Gemini API: free API key from ai.google.dev
- Local: brew install whisper-cpp (macOS) or equivalent
- OpenAI: API key from OpenAI

MCP Tools

The plugin exposes 6 MCP tools:

claude-video-vision

Popularity

What's Inside

README

Claude Code Video Vision

Features

Quick Start

1. Install the plugin

2. Configure

Usage

Slash command

Conversational

Backends

Architecture

Requirements

MCP Tools

Confidence

Similar Plugins

watch

peepshow

claude-watch

videodb

watch-cli

video-tool

Popularity

Health & Quality

Similar Plugins

watch

peepshow

claude-watch

videodb

watch-cli

video-tool