Vision-MCP
English | 中文
Vision-MCP is a desktop software interaction framework for agents. It combines
an MCP server, a reusable agent skill, and native desktop helpers so agents can
operate GUI applications with lower token cost, faster execution, and less
repeated visual exploration.
The framework uses a hybrid AX/UIA + OCR + vision-model architecture. It lets
agents inspect software through the cheapest reliable signal first, then turn
successful interaction paths into reusable vision-mcp.yaml maps made of
actions and workflows.
Why It Exists
Agents can use screenshots and vision models to operate desktop software, but
pure visual exploration is expensive and slow. Vision-MCP gives an agent a
structured workflow:
- Explore a GUI task once with accessibility trees, OCR, screenshots, and
visual fallback.
- Record stable states, locators, actions, postconditions, and workflows in a
vision-mcp.yaml map.
- Reuse those actions and workflows on later runs.
- Patch the map when the UI shifts instead of rediscovering the whole task.
For repeated desktop workflows, this turns software use from one-off visual
search into an increasingly reusable instruction layer.
How It Works
Reusable GUI Maps
On the first run, the agent explores the application state, clickable controls,
state transitions, and expected results. Vision-MCP stores that knowledge in
vision-mcp.yaml as:
- reusable actions, such as clicking a specific control or entering text
- higher-level workflows, composed from multiple actions
- state anchors and postconditions used to verify progress
- patch overlays that keep runtime fixes separate from trusted baseline maps
On later runs, the agent discovers available actions through MCP tools, reuses
existing workflows when possible, and only falls back to exploration when the
map does not yet cover the requested task.
Hybrid Exploration
Vision-MCP gives the agent multiple ways to understand a GUI:
- native accessibility trees through macOS AX or Windows UIA
- OCR for text regions and verification
- screenshots and visual-model fallback for non-native or visually dense apps
- window capsules for display, geometry, foregrounding, and live view support
Native structure is preferred when it is reliable. OCR and vision are used as
fallbacks or verification layers.
Quick Start
Claude Code
Inside Claude Code:
/plugin marketplace add Haruhiyuki/vision-mcp
/plugin install vision-mcp@vision-mcp
The plugin installs the skill, MCP server configuration, examples, and helper
bootstrap path.
Other MCP Hosts
For Codex, Cursor, Cline, OpenClaw, Hermes Agent, or any stdio MCP host, add a
server like this:
{
"mcpServers": {
"vision-mcp": {
"command": "npx",
"args": [
"-y",
"@vision-mcp/cli@latest",
"serve",
"--apps-root",
"${HOME}/.vision-mcp/apps"
]
}
}
}
Then run:
npx -y @vision-mcp/cli@latest doctor
npx -y @vision-mcp/cli@latest init-apps
For host-specific configuration paths, macOS and Windows permissions, upgrade
steps, and troubleshooting, see the Chinese install guide:
INSTALL.md.
Core Capabilities
Platform Support
| Capability | macOS | Windows |
|---|
| Native helper | Swift + ScreenCaptureKit + AX + Vision + IOKit | PowerShell 5.1 + Win32 + UIA + System.Drawing + WinRT |
| Modern screenshots | SCScreenshotManager on macOS 14+ | PrintWindow PW_RENDERFULLCONTENT on Windows 8.1+ |
| Accessibility tree | AXUIElement + osascript fallback | UIA TreeWalker + MSAA fallback |
| OCR | Vision framework | Windows.Media.Ocr |
| Input | NSPasteboard paste + CGEvent | SendInput VK_PACKET with modifier support |
| Foregrounding | NSWorkspace.activate | SwitchToThisWindow, AttachThreadInput, and fallbacks |
| Health checks | health.snapshot | health.snapshot with GDI/USER resource checks |
| Self-check | vision-mcp doctor | vision-mcp doctor |
Platform notes:
MCP Tool Surface
| Category | Tools |
|---|
| Discovery | list_apps, list_workflows, describe, describe_workflow, describe_action, list_actions |
| Execution | run_workflow, perform_action |
| Low-level actions | click_at, type_text, press_key, scroll |
| Exploration and vision | snapshot, annotated, OCR text click helpers |
| AX/UIA | ax-press for macOS AXPress and Windows UIA InvokePattern |
| Continuous correction | vision-mcp patch, patches |
| Window management | displays, capsule, restore, live-view |
| Diagnostics | doctor [--watch sec] |
| Repair | repair_minimal --max-level 3 |
vision-mcp.yaml Map Model
Vision-MCP maps are designed to make GUI knowledge durable: