Skill

image-recognition

Use this skill WHENEVER the user attaches, pastes, drops, or otherwise shares one or more images (screenshots, photos, scans of documents/licenses/certificates, UI captures, terminal/log captures, architecture or network diagrams, ER diagrams, charts) and wants them understood, described, transcribed, or used as the basis for any follow-up work. Trigger it even when the user does not say the word "image" — e.g. "вот скрин ошибки, помоги", "разбери эту схему", "что тут на картинке", "по этому документу заведи задачу", or simply drops a picture with a terse instruction. The skill reads every image meticulously so nothing is missed, builds a self-contained HTML report (images embedded as base64, with a detailed recognition write-up under each) for the user to verify, and then keeps the recognized content in context to solve whatever the user dictates next. Do NOT skip this skill just because the image "looks simple" — the whole point is to catch the small detail a glance would miss.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/image-tools:image-recognition

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

The user's workflow: they hand you images, you read them with extreme care, you

Supporting Files

evals/evals.jsonevals/evals_hard.jsonevals/files/_generate.pyevals/files/architecture_diagram.pngevals/files/error_screenshot.pngevals/files/license_certificate.pngevals/files/ui_current_dark.pngevals/files/ui_target_light.pngscripts/build_report.py

SKILL.md

178 lines · ~2.4k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Image Recognition & Verified Understanding

The user's workflow: they hand you images, you read them with extreme care, you produce a verification report they can eyeball, they confirm (or correct) what you saw, and only then do you act on the content. This skill exists because decisions built on a misread image are silently wrong — a single transposed digit in an error code, a checkbox that was actually unchecked, an arrow pointing the other way in a diagram, or a license number off by one character can derail everything downstream. The verification report is the safety net: it lets the user catch a miss before work is built on top of it.

The workflow

Follow these steps in order. Do not start solving the user's actual task until the recognition report exists and the user has had a chance to verify it.

1. Collect the image paths

Use the file paths the user provided or that the harness attached. If it's ambiguous which files are the images in question, ask. Handle one image or many.

2. Read every image carefully with `Read`

Open each image with the Read tool — it renders the image visually at full resolution. Read the whole image in one pass first. Claude's vision handles dense, high-resolution screenshots, tables, and mockups well, so a single attentive pass usually captures even small text. Read it the way you'd proofread a contract, not the way you'd skim a thumbnail: see every element, understand the composition (how things are laid out and how they relate), and miss nothing.

Don't reflexively slice the image. Cropping or zooming into a region costs real time and tokens, so reach for it only when a specific spot is genuinely too small or blurry to read with confidence after the full-resolution pass — a tiny toggle, a truncated hostname, the fine print inside a stamp. Crop that one spot, not the whole image. Meticulousness lives in your attention and systematic scanning (use the checklist below), not in mechanically chopping every screenshot into pieces — that just burns budget without seeing more.

Transcribe text and numbers exactly as shown — verbatim, including apparent mistakes. Copy them character-for-character: keep typos (Postgress with a double s), unusual casing, odd spacing, mixed Cyrillic/Latin (РОСС RU.0001.11АВ29), and truncation (keep the …). Do not "correct", normalize, translate, or expand them. This matters because the user is often verifying exactly those details — silently fixing a typo or tidying a code hides what is really on screen, which defeats the entire purpose of the report. Paraphrase is where the decisive detail dies: "an error about a license" is useless; the literal ERROR 4012: license key ZX-7731-XX expired 2026-05-18 is what the user needs.

Use the recognition checklist below so you scan systematically instead of fixating on whatever caught your eye first.

3. Write a detailed recognition write-up for each image

For each image, write a thorough description in Russian (the language the user works in). It should be detailed enough that someone who can't see the image could rebuild a faithful mental picture from your words alone. Lead with the overall composition, then enumerate the elements. Keep all transcribed text/numbers verbatim, set in code or quotes so they stand out.

4. Build the HTML verification report

Hand the images and your write-ups to the bundled script — it embeds each image as base64 (a self-contained file the user can open with no external dependencies) and renders your descriptions underneath. See "Building the report" below for the exact spec format and command. The report goes into <project-root>/docs/image-recognition/.

5. Give the user the path and ask them to verify

Do not auto-open the file. Reply with the clickable absolute path and a short note asking the user to check whether everything was recognized correctly and nothing small was missed. Then wait for their confirmation or corrections.

6. Hold the recognized content in context and act on it

Once verified (and corrected, if needed), keep everything you recognized in working memory. When the user dictates a task, operate on that recognized data — don't make them re-explain what's in the images. If the user corrects a detail during verification, update your understanding accordingly before proceeding.

Recognition checklist — what "miss nothing" means

Scan for all of these. Not every category applies to every image; skip the ones that genuinely don't, but don't skip a category just because it's tedious.

Always:

Overall composition and layout — what's where, what's grouped with what, reading order.
Every distinct region/panel/section of the image.
All visible text, transcribed verbatim — including typos, odd casing, and partial/cut-off text (don't fix it; note it's cut off).
All numbers, codes, IDs, dates, sums, versions — verbatim, exactly as printed.
Icons, badges, status indicators, color where it carries meaning (red error, green OK, highlighted row).
Anything emphasized: selection, focus, highlight, underline, arrows, callouts.

Screenshots / UI captures: window or app title; menus and breadcrumbs; buttons and their state (enabled/disabled/active); every form field and its current value; checkbox and radio states (checked vs unchecked — state matters); table headers and cell contents; error/warning/info banners with their full text and any code; the element that's selected or focused.

Terminal / log captures: transcribe the visible output verbatim, preserving the sequence; note the command if shown; call out stack traces, error codes, file paths, timestamps, and exit statuses.

Documents / scans (licenses, certificates, invoices, contracts): transcribe all text and numbers verbatim — license/certificate numbers, registration IDs, dates, sums, counterparty and organization names, product names, validity periods. Note stamps, seals, signatures, logos, letterheads, and any handwriting. Describe table structure and contents. Flag anything illegible.

Diagrams / schemas / ER diagrams / network maps: enumerate every node/box with its exact label; every connection/edge/arrow with its direction and any label on it; groupings, clusters, swimlanes, zones; the legend; and the overall topology or flow in plain words ("requests enter at A, fan out to B and C, both write to D").

Charts / graphs: chart type; axis titles, units, and ranges; each series with its label and color; notable values, peaks, trends; the legend; any annotations.

Building the report

Write a spec JSON, then run the script. The script reads each image, base64-embeds it, renders your markdown description beneath it, and writes a single self-contained HTML file.

Spec format (items are rendered in order):

{
  "title": "Распознавание изображений — <короткий контекст>",
  "items": [
    {
      "image": "/absolute/path/to/screenshot.png",
      "title": "Картинка 1 — скриншот ошибки выпуска лицензии",
      "description": "Markdown-описание. Поддерживаются заголовки (#, ##), списки (- ...),\n**жирный**, *курсив*, `код`. Дословный текст с картинки бери в `обратные кавычки`."
    }
  ]
}

Command (run from anywhere; paths are absolute):

python3 "$SKILL_DIR/scripts/build_report.py" \
  --spec /path/to/spec.json \
  --output "<project-root>/docs/image-recognition/recognition-$(date +%Y%m%d-%H%M%S).html"

$SKILL_DIR is this skill's directory. The script prints the absolute path of the HTML it wrote — relay that to the user. If --output is omitted, it defaults to ./docs/image-recognition/recognition-<timestamp>.html under the current working directory.

A convenient place for the spec file is next to the report (e.g. write it to <project-root>/docs/image-recognition/.spec-<timestamp>.json). It's a harmless build artifact; no need to delete it.

Folder convention

Reports live in <project-root>/docs/image-recognition/. Create the folder if it doesn't exist. Each batch of images produces a new timestamped HTML file — don't overwrite earlier reports. These files embed images as base64 and can get large; if you'd rather not track them in git, add docs/image-recognition/ to .gitignore — but leave that version-control decision to the user.

image-recognition

Invocation

Context Preview

Supporting Files

SKILL.md

image-recognition

Invocation

Context Preview

Supporting Files

SKILL.md

Image Recognition & Verified Understanding

The workflow

1. Collect the image paths

2. Read every image carefully with `Read`

3. Write a detailed recognition write-up for each image

4. Build the HTML verification report

5. Give the user the path and ask them to verify

6. Hold the recognized content in context and act on it

Recognition checklist — what "miss nothing" means

Building the report

Folder convention

Similar Skills

Image Recognition & Verified Understanding

The workflow

1. Collect the image paths

2. Read every image carefully with `Read`

3. Write a detailed recognition write-up for each image

4. Build the HTML verification report

5. Give the user the path and ask them to verify

6. Hold the recognized content in context and act on it

Recognition checklist — what "miss nothing" means

Building the report

Folder convention

Similar Skills

image-recognition

Invocation

Context Preview

Supporting Files

SKILL.md

image-recognition

Invocation

Context Preview

Supporting Files

SKILL.md

Image Recognition & Verified Understanding

The workflow

1. Collect the image paths

2. Read every image carefully with Read

3. Write a detailed recognition write-up for each image

4. Build the HTML verification report

5. Give the user the path and ask them to verify

6. Hold the recognized content in context and act on it

Recognition checklist — what "miss nothing" means

Building the report

Folder convention

Similar Skills

Image Recognition & Verified Understanding

The workflow

1. Collect the image paths

2. Read every image carefully with Read

3. Write a detailed recognition write-up for each image

4. Build the HTML verification report

5. Give the user the path and ask them to verify

6. Hold the recognized content in context and act on it

Recognition checklist — what "miss nothing" means

Building the report

Folder convention

Similar Skills

2. Read every image carefully with `Read`

2. Read every image carefully with `Read`