From rvanbaalen
Extract text from images and scanned PDFs using OCR. Supports 100+ languages, table detection, structured output (markdown/JSON), and batch processing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/rvanbaalen:ocr-document-processorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
from scripts.ocr_processor import OCRProcessor
# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # Text blocks with positions
from scripts.ocr_processor import OCRProcessor
# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()
# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # All pages
# Specific pages
text = processor.extract_text(pages=[1, 2, 3])
# Get detailed results
result = processor.extract_structured()
# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language
# Export to Markdown
processor.export_markdown("output.md")
# Export to JSON
processor.export_json("output.json")
# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")
# Export to HTML
processor.export_html("output.html")
# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')
# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')
| Code | Language | Code | Language |
|---|---|---|---|
| eng | English | fra | French |
| deu | German | spa | Spanish |
| ita | Italian | por | Portuguese |
| rus | Russian | chi_sim | Chinese (Simplified) |
| chi_tra | Chinese (Traditional) | jpn | Japanese |
| kor | Korean | ara | Arabic |
| hin | Hindi | nld | Dutch |
Preprocessing improves OCR accuracy on low-quality images.
# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # Fix rotation
denoise=True, # Remove noise
threshold=True, # Binarize image
contrast=1.5 # Enhance contrast
)
text = processor.extract_text()
| Option | Description | Default |
|---|---|---|
deskew | Correct skewed/rotated images | False |
denoise | Remove noise and artifacts | False |
threshold | Convert to black/white | False |
threshold_method | 'otsu', 'adaptive', 'simple' | 'otsu' |
contrast | Contrast factor (1.0 = no change) | 1.0 |
sharpen | Sharpen factor (0 = none) | 0 |
scale | Upscale factor for small text | 1.0 |
remove_shadows | Remove shadow artifacts | False |
# Extract tables from document
tables = processor.extract_tables()
# Each table is a list of rows
for table in tables:
for row in table:
print(row)
# Export tables to CSV
processor.export_tables_csv("tables/")
# Export to JSON
processor.export_tables_json("tables.json")
# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
# Process specific pages
page_3 = processor.extract_text(pages=[3])
# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
from scripts.ocr_processor import batch_ocr
# Process directory of images
results = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount
# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs
processor = OCRProcessor("document.png")
# Configure OCR settings
processor.config.update({
'psm': 3, # Page segmentation mode
'oem': 3, # OCR engine mode
'dpi': 300, # DPI for processing
'timeout': 30, # Timeout in seconds
'min_confidence': 60, # Minimum word confidence
})
| Mode | Description |
|---|---|
| 0 | Orientation and script detection only |
| 1 | Automatic page segmentation with OSD |
| 3 | Fully automatic page segmentation (default) |
| 4 | Assume single column of text |
| 6 | Assume single uniform block of text |
| 7 | Treat image as single text line |
| 8 | Treat image as single word |
| 11 | Sparse text. Find as much text as possible |
| 12 | Sparse text with OSD |
# Basic extraction
python ocr_processor.py image.png -o output.txt
# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown
# Specify language
python ocr_processor.py german.png --lang deu
# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch
# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
npx claudepluginhub rvanbaalen/skills --plugin ocr-document-processorThis skill should be used when the user says "process documents", "extract text from PDF", "OCR this document", "convert PDF to markdown", "extract emails from documents", "parse document", "document conversion", "batch OCR", "extract structured data from PDF", "read PDF", "extract tables from PDF", "convert Word document", "convert docx to markdown", or wants to extract, convert, or process documents and scanned images.
Parses complex documents with PaddleOCR to extract text, tables, formulas, charts, and layout structure. Use for invoices, academic papers, multi-column layouts, or any document needing structured understanding.
Parses PDFs, DOCX, PPTX, HTML, images (20+ formats) to Markdown/HTML/JSON/text with layout/tables/OCR. Chunks for RAG pipelines; batch converts via DocumentConverter.