From api-ai-claude-vision
Image understanding and document analysis with Claude's multimodal capabilities -- image input formats, PDF processing, multi-image patterns, structured extraction, and token cost estimation
How this skill is triggered — by the user, by Claude, or both
Slash command
/api-ai-claude-vision:api-ai-claude-visionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Quick Guide:** Use `type: "image"` content blocks for images (base64, URL, or file_id) and `type: "document"` content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula: `tokens = (width * height) / 750`. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs use `type: "...
Quick Guide: Use
type: "image"content blocks for images (base64, URL, or file_id) andtype: "document"content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula:tokens = (width * height) / 750. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs usetype: "document"withmedia_type: "application/pdf". No OCR library needed -- Claude reads text directly from images and PDFs.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide max_tokens in every request -- it is required and has no default)
(You MUST iterate over response.content blocks -- never assume a single text block in the response)
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
</critical_requirements>
Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API
When to use:
Key patterns covered:
When NOT to use:
Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.
Core principles:
messages array, interleaved with text. They are not uploaded separately or referenced by URL-only.documents first, query last improves text prompts. Claude processes visual content better when it sees the image before the question.tokens = (width * height) / 750. Downsizing images before sending saves tokens without losing meaningful detail for most use cases.When to use vision:
When NOT to use:
Read a local file, encode to base64, send as type: "image" content block. Image block before text block.
// Image block first, text prompt second, iterate response content blocks
content: [
{
type: "image",
source: { type: "base64", media_type: "image/png", data: imageData },
},
{ type: "text", text: "Describe what you see in this image." },
];
Why good: Image before text improves results, explicit media_type, structured content blocks
// BAD: base64 as text string -- Claude cannot interpret raw base64
content: "What's in this image? " + imageData;
Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image
See: examples/core.md for full runnable examples with base64, URL, and Files API
Three source types for images. Choose based on where your image lives.
// URL source -- simplest, smallest payload
source: { type: "url", url: "https://example.com/chart.png" }
// Base64 source -- local files
source: { type: "base64", media_type: "image/jpeg", data: base64String }
// Files API source (beta) -- upload once, reuse across requests
source: { type: "file", file_id: "file_abc123" }
When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use
See: examples/core.md for full examples of each source type
PDFs use type: "document" -- different from type: "image". This is the most common mistake.
// Correct: type "document" for PDFs
{ type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
// WRONG: type "image" for PDFs -- causes API errors
{ type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
Why good: type: "document" enables dual processing (text extraction + page rendering)
Why bad: Using type: "image" for PDFs causes API errors. PDFs require type: "document".
See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching
Label images with text blocks so Claude can reference them clearly.
content: [
{ type: "text", text: "Image 1:" },
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: image1 },
},
{ type: "text", text: "Image 2:" },
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: image2 },
},
{
type: "text",
text: "Compare these two images and describe the differences.",
},
];
Why good: Labels let Claude reference specific images unambiguously
Why bad (without labels): Claude may confuse which image is which when no labels are provided
See: examples/core.md for full multi-image example
Token formula: tokens = (width * height) / 750. Auto-resize triggers at 1568px long edge or ~1.15 megapixels.
const TOKENS_PER_PIXEL_DIVISOR = 750;
const MAX_LONG_EDGE_PX = 1568;
const MAX_MEGAPIXELS = 1.15;
function estimateImageTokens(width: number, height: number): number {
let w = width,
h = height;
const longEdge = Math.max(w, h);
const mp = (w * h) / 1_000_000;
if (longEdge > MAX_LONG_EDGE_PX || mp > MAX_MEGAPIXELS) {
const scale = Math.min(
MAX_LONG_EDGE_PX / longEdge,
Math.sqrt(MAX_MEGAPIXELS / mp),
);
w = Math.round(width * scale);
h = Math.round(height * scale);
}
return Math.ceil((w * h) / TOKENS_PER_PIXEL_DIVISOR);
}
// 200x200: ~54 tokens | 1000x1000: ~1334 | 4000x3000: ~1590 (auto-resized)
Why good: Named constants, accounts for auto-resize, documents the formula
See: reference.md for the complete size/token/cost table, examples/core.md for countTokens() usage
Combine vision with messages.parse() and Zod schemas for typed extraction.
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import { z } from "zod";
const ReceiptData = z.object({
merchant: z.string(),
date: z.string(),
items: z.array(
z.object({ name: z.string(), quantity: z.number(), price: z.number() }),
),
total: z.number(),
currency: z.string(),
});
const response = await client.messages.parse({
model: "claude-sonnet-4-6",
max_tokens: MAX_TOKENS,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/jpeg",
data: receiptImage,
},
},
{
type: "text",
text: "Extract all receipt information from this image.",
},
],
},
],
output_config: { format: zodOutputFormat(ReceiptData) },
});
const receipt = response.parsed_output; // fully typed
Why good: Zod schema for type-safe extraction, messages.parse() for auto-validation, image before text
See: examples/extraction.md for chart extraction, form extraction, multi-document extraction, PDF caching
Image resolution vs token cost:
200x200 -> ~54 tokens ($0.00016/image at Sonnet 4.6 pricing)
1000x1000 -> ~1334 tokens ($0.004/image)
1092x1092 -> ~1590 tokens ($0.0048/image) -- max 1:1 without auto-resize
4000x3000 -> ~1590 tokens (auto-resized to fit 1568px long edge)
file_idcache_control: { type: "ephemeral" } when asking multiple questions about the same documentclient.messages.countTokens()) before expensive requests to estimate costs<decision_framework>
Where is your image?
+-- Local file -> Base64 encode with readFileSync().toString("base64")
+-- Public URL -> Use type: "url" source (simplest, smallest payload)
+-- Already uploaded -> Use type: "file" source with file_id (Files API, beta)
+-- Multiple requests -> Upload once via Files API, reuse file_id
What type of file?
+-- JPEG, PNG, GIF, WebP -> type: "image"
+-- PDF -> type: "document" with media_type: "application/pdf"
+-- Other formats -> Convert to a supported format first
What kind of analysis?
+-- Brief description -> 256-512 max_tokens
+-- Detailed analysis -> 1024-2048 max_tokens
+-- Document summarization -> 2048-4096 max_tokens
+-- Structured extraction -> 1024 max_tokens (JSON output is compact)
</decision_framework>
<red_flags>
High Priority Issues:
type: "image" for PDFs -- PDFs require type: "document" with media_type: "application/pdf"max_tokens -- required on every request, no defaultMedium Priority Issues:
cache_control when asking multiple questions about the same PDF -- each request re-processes the full documentCommon Mistakes:
Gotchas & Edge Cases:
betas: ["files-api-2025-04-14"])</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide max_tokens in every request -- it is required and has no default)
(You MUST iterate over response.content blocks -- never assume a single text block in the response)
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
Failure to follow these rules will produce API errors, degraded vision quality, unexpected token costs, or runtime crashes from untyped content blocks.
</critical_reminders>
npx claudepluginhub agents-inc/skills --plugin api-ai-claude-visionAnalyzes images with MiniMax vision tool for description, OCR, text extraction, UI mockup review, chart data parsing, diagrams. Auto-triggers on image shares or analysis requests.
Analyzes media files (PDFs, images, diagrams, screenshots) using a vision backend to extract structured data, descriptions, or summaries instead of literal file reading.
Analyzes PDFs, images, videos, YouTube links, and documents using Google Gemini. Generates images from text prompts with Nano Banana Pro.