Skill

document-ocr

Extrae texto de PDFs e imágenes usando AWS Textract. Especialmente útil para PDFs de diseño (Illustrator) sin texto extraíble. Trigger en "extraer texto", "OCR", "leer PDF escaneado", "documento sin texto", "Textract".

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/Muno-OS (beta):document-ocr

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Extrae texto de PDFs e imágenes usando AWS Textract. Útil para documentos escaneados o PDFs de diseño que no tienen texto extraíble.

SKILL.md

112 lines · ~747 tokens

Stats

LanguagePython

Stars3

Forks1

MaintenanceGood

Last CommitMar 20, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Document OCR

Extrae texto de PDFs e imágenes usando AWS Textract. Útil para documentos escaneados o PDFs de diseño que no tienen texto extraíble.

Archivos en este Skill

document-ocr/
├── SKILL.md

Scripts de referencia (en /scripts/ocr/):

textract_pdf_analyzer.py - Procesar PDFs
textract_images_analyzer.py - Procesar imagenes

Prerequisitos

AWS CLI configurado con credenciales
Permisos para AWS Textract
Región: us-east-1 (default)
Python 3.x con boto3

Flujo de Trabajo

Para PDFs

python3 scripts/ocr/textract_pdf_analyzer.py "documento.pdf" "resultado"

Output: Archivos .txt con texto extraido por pagina.

Para Imagenes

python3 scripts/ocr/textract_images_analyzer.py "carpeta_imagenes/" "resultado"

Para PDFs de Illustrator/Diseño

Los PDFs creados en Illustrator o herramientas de diseño frecuentemente no tienen texto extraíble (el texto está como curvas/paths).

Solución: Convertir a imágenes primero

# 1. Convertir PDF a imágenes con ImageMagick
magick -density 300 "archivo.pdf" -quality 100 "output/page.png"

# 2. Procesar imágenes con Textract
python3 scripts/ocr/textract_images_analyzer.py "output/" "resultado"

Gotchas

PDFs de diseño: El Read tool de Claude no extrae texto de PDFs de Illustrator. Usar flujo de conversión a imagen + Textract.
Límites de Textract:
- Máximo 5MB por documento síncrono
- Para documentos grandes, usar async con S3
Calidad de imagen importa: Para mejor OCR, usar density 300 o más al convertir PDF.
Tablas: Textract puede detectar tablas, pero el output puede necesitar post-procesamiento.
Idioma: Textract funciona mejor con inglés, pero soporta español y otros idiomas.
Costos: Textract tiene costo por página. Verificar pricing antes de procesar documentos grandes.

AWS Configuration

Región: us-east-1
Cuenta: [configurar en AWS CLI]

Scripts

textract_pdf_analyzer.py

python3 scripts/ocr/textract_pdf_analyzer.py input.pdf output_prefix
# Envia PDF a Textract, guarda resultado en .txt y .json
# Para PDFs >5MB usa analisis asincrono con S3

textract_images_analyzer.py

python3 scripts/ocr/textract_images_analyzer.py input_folder/ output_prefix
# Procesa todas las imagenes PNG en la carpeta

Output

Archivos de texto plano con el contenido extraído:

resultado_page_1.txt
resultado_page_2.txt
etc.

O archivo consolidado:

resultado_full.txt

Alternativas

Si Textract no está disponible:

Google Cloud Vision API - Similar funcionalidad
Tesseract - Open source, local, menos preciso
Adobe Acrobat - Para PDFs con texto oculto

document-ocr

Popularity

Invocation

Context Preview

SKILL.md

document-ocr

Popularity

Invocation

Context Preview

SKILL.md

Document OCR

Archivos en este Skill

Prerequisitos

Flujo de Trabajo

Para PDFs

Para Imagenes

Para PDFs de Illustrator/Diseño

Gotchas

AWS Configuration

Scripts

textract_pdf_analyzer.py

textract_images_analyzer.py

Output

Alternativas

Similar Skills

Document OCR

Archivos en este Skill

Prerequisitos

Flujo de Trabajo

Para PDFs

Para Imagenes

Para PDFs de Illustrator/Diseño

Gotchas

AWS Configuration

Scripts

textract_pdf_analyzer.py

textract_images_analyzer.py

Output

Alternativas

Similar Skills