vision-mcp

Automate desktop GUI operations by combining vision-based screenshot analysis with accessibility tree recognition, enabling agents to click, type, and navigate real macOS/Windows applications like a human, with caching for repeated tasks and optional safety approval workflows.

Vision-MCP

English | 中文

Vision-MCP is a desktop software interaction framework for agents. It combines an MCP server, a reusable agent skill, and native desktop helpers so agents can operate GUI applications with lower token cost, faster execution, and less repeated visual exploration.

The framework uses a hybrid AX/UIA + OCR + vision-model architecture. It lets agents inspect software through the cheapest reliable signal first, then turn successful interaction paths into reusable vision-mcp.yaml maps made of actions and workflows.

Why It Exists

Agents can use screenshots and vision models to operate desktop software, but pure visual exploration is expensive and slow. Vision-MCP gives an agent a structured workflow:

Explore a GUI task once with accessibility trees, OCR, screenshots, and visual fallback.
Record stable states, locators, actions, postconditions, and workflows in a vision-mcp.yaml map.
Reuse those actions and workflows on later runs.
Patch the map when the UI shifts instead of rediscovering the whole task.

For repeated desktop workflows, this turns software use from one-off visual search into an increasingly reusable instruction layer.

How It Works

Reusable GUI Maps

On the first run, the agent explores the application state, clickable controls, state transitions, and expected results. Vision-MCP stores that knowledge in vision-mcp.yaml as:

reusable actions, such as clicking a specific control or entering text
higher-level workflows, composed from multiple actions
state anchors and postconditions used to verify progress
patch overlays that keep runtime fixes separate from trusted baseline maps

On later runs, the agent discovers available actions through MCP tools, reuses existing workflows when possible, and only falls back to exploration when the map does not yet cover the requested task.

Hybrid Exploration

Vision-MCP gives the agent multiple ways to understand a GUI:

native accessibility trees through macOS AX or Windows UIA
OCR for text regions and verification
screenshots and visual-model fallback for non-native or visually dense apps
window capsules for display, geometry, foregrounding, and live view support

Native structure is preferred when it is reliable. OCR and vision are used as fallbacks or verification layers.

Quick Start

Claude Code

Inside Claude Code:

/plugin marketplace add Haruhiyuki/vision-mcp
/plugin install vision-mcp@vision-mcp

The plugin installs the skill, MCP server configuration, examples, and helper bootstrap path.

Other MCP Hosts

For Codex, Cursor, Cline, OpenClaw, Hermes Agent, or any stdio MCP host, add a server like this:

{
  "mcpServers": {
    "vision-mcp": {
      "command": "npx",
      "args": [
        "-y",
        "@vision-mcp/cli@latest",
        "serve",
        "--apps-root",
        "${HOME}/.vision-mcp/apps"
      ]
    }
  }
}

Then run:

npx -y @vision-mcp/cli@latest doctor
npx -y @vision-mcp/cli@latest init-apps

For host-specific configuration paths, macOS and Windows permissions, upgrade steps, and troubleshooting, see the Chinese install guide: INSTALL.md.

Core Capabilities

Platform Support

Capability	macOS	Windows
Native helper	Swift + ScreenCaptureKit + AX + Vision + IOKit	PowerShell 5.1 + Win32 + UIA + System.Drawing + WinRT
Modern screenshots	`SCScreenshotManager` on macOS 14+	`PrintWindow PW_RENDERFULLCONTENT` on Windows 8.1+
Accessibility tree	AXUIElement + osascript fallback	UIA TreeWalker + MSAA fallback
OCR	Vision framework	Windows.Media.Ocr
Input	NSPasteboard paste + CGEvent	SendInput VK_PACKET with modifier support
Foregrounding	`NSWorkspace.activate`	`SwitchToThisWindow`, AttachThreadInput, and fallbacks
Health checks	`health.snapshot`	`health.snapshot` with GDI/USER resource checks
Self-check	`vision-mcp doctor`	`vision-mcp doctor`

Platform notes:

MCP Tool Surface

Category	Tools
Discovery	`list_apps`, `list_workflows`, `describe`, `describe_workflow`, `describe_action`, `list_actions`
Execution	`run_workflow`, `perform_action`
Low-level actions	`click_at`, `type_text`, `press_key`, `scroll`
Exploration and vision	`snapshot`, `annotated`, OCR text click helpers
AX/UIA	`ax-press` for macOS AXPress and Windows UIA InvokePattern
Continuous correction	`vision-mcp patch`, `patches`
Window management	`displays`, `capsule`, `restore`, `live-view`
Diagnostics	`doctor [--watch sec]`
Repair	`repair_minimal --max-level 3`

`vision-mcp.yaml` Map Model

Vision-MCP maps are designed to make GUI knowledge durable:

vision-mcp

Popularity

What's Inside

README

Vision-MCP

Why It Exists

How It Works

Reusable GUI Maps

Hybrid Exploration

Quick Start

Claude Code

Other MCP Hosts

Core Capabilities

Platform Support

MCP Tool Surface

`vision-mcp.yaml` Map Model

Confidence

Similar Plugins

ecc

chrome-devtools-mcp

figma

claude-vibes

atlassian

claude-mem

Popularity

Health & Quality

Similar Plugins

ecc

chrome-devtools-mcp

figma

claude-vibes

atlassian

claude-mem

vision-mcp

Popularity

What's Inside

README

Vision-MCP

Why It Exists

How It Works

Reusable GUI Maps

Hybrid Exploration

Quick Start

Claude Code

Other MCP Hosts

Core Capabilities

Platform Support

MCP Tool Surface

vision-mcp.yaml Map Model

Confidence

Similar Plugins

ecc

chrome-devtools-mcp

figma

claude-vibes

atlassian

claude-mem

Popularity

Health & Quality

Similar Plugins

ecc

chrome-devtools-mcp

figma

claude-vibes

atlassian

claude-mem

`vision-mcp.yaml` Map Model