VLM Run Skills

Website | Platform | Docs | Blog | Discord

VLM Run Skills are definitions for visual AI tasks like image understanding, video processing, and document extraction. They are interoperable with Anthropic's Claude Code.

The Skills in this repository follow the standardized Agent Skill format.

How do Skills work?

In practice, skills are self-contained folders that package instructions, scripts, and resources together for an AI agent to use on a specific use case. Each folder includes a SKILL.md file with YAML frontmatter (name and description) followed by the guidance your coding agent follows while the skill is active.

Features

Image Intelligence

Understanding & Captioning: Describe, analyze, and interpret images with state-of-the-art visual intelligence
Detection & Localization: Detect and locate objects, people, faces, and custom entities with bounding boxes
Segmentation: Segment objects, scenes, and regions with pixel-level precision
Generation & Editing: Generate images from text, edit existing images, apply super-resolution, colorize B&W photos
Tools: Crop, rotate, enhance resolution (4x-8x upscaling), de-oldify (colorization)
Visual Grounding: Point to and extract specific elements using natural language queries
UI Parsing: Extract UI elements, layouts, and hierarchies from screenshots

Video Intelligence

Understanding & Captioning: Describe video content, generate summaries and detailed scene analysis
Transcription: Extract audio transcripts with timestamps
Tools: Trim videos, extract keyframes, sample frames at intervals, detect highlights
Segmentation: Identify and segment objects across video frames
Generation & Editing: Generate videos from text prompts, edit existing videos

Document Intelligence

Layout Understanding: Detect headers, paragraphs, tables, figures, lists, and structural elements
Multi-Page Analysis: Process and analyze PDFs with intelligent page-aware extraction
Markdown Extraction: Convert documents to clean, structured markdown with preserved formatting
Visual Grounding: Locate and extract specific fields, sections, or data points
Data Extraction: Extract key information from invoices, receipts, contracts, forms into structured JSON

Multi-modal Agents

Multi-Modal Reasoning: Execute complex multi-step workflows across images, documents, and videos
Structured Outputs: Get results in validated JSON schemas with automatic retry logic

See docs and technical whitepaper for more information.

Installation

Prerequisites

Get your VLM Run API key from app.vlm.run
Have uv installed for Python environment management

Claude Code

/plugin marketplace add vlm-run/skills

To install a skill, run:

/plugin install <skill-name>@vlm-run/skills

For example:

/plugin install vlmrun-cli-skill@vlm-run/skills

Configure your API key

Once the skill is installed, configure your API key using the CLI (get your key from app.vlm.run):

vlmrun config init
vlmrun config set --api-key <your-api-key>
vlmrun config show

Verify Installation

Once installed, verify the skill is loaded by asking Claude Code (requires restart):

What skills are available in the /vlmrun-cli-skill?

vlmrun-skills

Popularity

What's Inside

README

VLM Run Skills

How do Skills work?

Features

Image Intelligence

Video Intelligence

Document Intelligence

Multi-modal Agents

Installation

Prerequisites

Claude Code

Configure your API key

Verify Installation

Installing in Claude for Desktop

Confidence

Similar Plugins

peepshow

gr

vision-specialist

computer-vision-processor

claude-video-vision

watch