Skill

learn-document

Teach the system to recognize a new supplier invoice format, or finetune an existing one. Receives the PDF path as argument (e.g. /learn-document path/to/file.pdf). If no parser exists, creates one from scratch. If a parser exists but has low confidence or wrong values, enters review/finetune mode to fix the regex patterns. Triggers on: learn document, teach document, new supplier, fix parser, finetune, corrigir parser, afinar parser, aprender documento, ensinar documento.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/supplier-invoice-service:learn-document

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Create or finetune a deterministic parser plugin from a PDF file, test it, get user validation, and register it in Supabase.

SKILL.md

242 lines · ~2.4k tokens

Stats

Parent stars0

MaintenanceExcellent

Last CommitFeb 20, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

New Parser — Plugin Creator & Finetuner

Create or finetune a deterministic parser plugin from a PDF file, test it, get user validation, and register it in Supabase.

Pre-flight check (REQUIRED)

Before calling ANY MCP tool, verify the server is reachable by checking if parse_invoice exists as an available tool.

If the tool is NOT available, the MCP server is configured by the plugin via plugin.json but needs the REQUEST_MCP_TOKEN environment variable set in ~/.claude/settings.json.

Run the automatic token setup:

Check if the token already exists: read ~/.claude/settings.json and look for env.REQUEST_MCP_TOKEN
If it exists → the issue is something else (server down, plugin not enabled). Tell the user to check /mcp.
If it does NOT exist → ask the user using AskUserQuestion: "Para usar o MCP server request, preciso do teu token de autenticação. Qual é o token?"
Once the user provides the token, save it to ~/.claude/settings.json:
- Read the existing file (or start with {} if it doesn't exist)
- Merge {"env": {"REQUEST_MCP_TOKEN": "<TOKEN>"}} into the existing JSON (preserve all other settings)
- Write the file back using the Write tool
Tell the user: "Token guardado! Reinicia o Claude Code para ativar (claude de novo neste terminal)."
Stop immediately — do NOT attempt any MCP operations until the user restarts.

Do NOT attempt to parse, create, update, or perform any operation without the MCP server running.

Auth token

The token lives in ~/.claude/settings.json → env.REQUEST_MCP_TOKEN (used by the plugin MCP server automatically and for HTTP uploads).

Never use curl, $(), or python3 -c inline blobs — all trigger permission prompts in Claude Code. Always use the upload script.

If the token key does NOT exist in settings, run the token setup from the Pre-flight check section above.

Upload script

The plugin includes a standalone upload script at scripts/upload.py (relative to the plugin root). Find it with:

find ~/.claude -path "*/supplier-invoice-service/scripts/upload.py" -print -quit 2>/dev/null

This script reads the token from settings automatically and uploads via Python urllib — no curl, no permission prompts.

How to parse a PDF (2-step flow)

The MCP server runs remotely. PDFs must be uploaded first via HTTP, then parsed via MCP tool.

Step 1: Upload the PDF

python3 /path/to/supplier-invoice-service/scripts/upload.py /path/to/fatura.pdf

Output: fatura.pdf\t<file_id> (tab-separated filename and file_id).

Step 2: Parse via MCP tool

parse_invoice(file_id="<file_id>")

NEVER use pdf_path — the server is remote and cannot access local files. Always use the upload → file_id flow.

Usage

/learn-document <path/to/file.pdf>

The ARGUMENTS passed to this skill contain the PDF file path.

Routing: create vs finetune

Upload and parse the PDF first (upload → file_id → parse_invoice), then decide:

Result	Action
No match	→ Create workflow
Confidence < 1.0	→ Finetune workflow
Confidence = 1.0 but values wrong	→ Finetune workflow
Confidence = 1.0, values correct	→ Nothing to do

MCP Tools

Operation	MCP Tool
Parse a PDF	`parse_invoice(file_id)` — upload first via HTTP
View parser source	`get_parser_source(name)`
Create new parser	`create_parser(name, source)`
Update existing parser	`update_parser(name, source)`
Disable a parser	`disable_parser(name)`
Re-enable a parser	`enable_parser(name)`

Workflow — creating a new parser

Step 1: Parse and get raw text

Upload the PDF (upload script → file_id), then run parse_invoice(file_id=...). If no match, the raw extracted text is returned. Capture it.

Step 2: Analyze the extracted text

Identify: supplier name, NIF/VAT, invoice number pattern, date format, period (if applicable), monetary values, currency, VAT exemption notes.

Step 3: Write the plugin

import re
from .base import InvoiceParser


class SupplierParser(InvoiceParser):
    """Parser determinístico para faturas de Supplier."""

    NIF = "123456789"

    def can_parse(self, text: str, filename: str = "") -> bool:
        t = text.lower()
        return "supplier keyword" in t or self.NIF in text.replace(" ", "")

    def parse(self, text: str, filename: str = "") -> dict:
        result = self.empty_result()
        result["fornecedor"] = "Supplier"
        result["nif_fornecedor"] = self.NIF
        result["ficheiro"] = filename
        result["moeda"] = "EUR"
        warnings = []

        # Invoice number
        m = re.search(r"Fatura\s+n[.ºo°]\s*(\S+)", text)
        if m:
            result["numero"] = m.group(1)
        else:
            warnings.append("numero não encontrado")

        # Date
        m = re.search(r"Data[:\s]+(\d{2})[/.-](\d{2})[/.-](\d{4})", text)
        if m:
            result["data_emissao"] = f"{m.group(1)}-{m.group(2)}-{m.group(3)}"
        else:
            warnings.append("data_emissao não encontrada")

        # Subtotal, IVA, Total — adapt regex to supplier layout
        # ...

        # Confidence
        campos_chave = ["numero", "data_emissao", "subtotal", "total"]
        preenchidos = sum(1 for c in campos_chave if result[c] is not None)
        result["confidence"] = round(preenchidos / len(campos_chave), 2)

        result["warnings"] = warnings
        return result

Step 4: Register

Use create_parser(name, source). If parser already exists, use update_parser(name, source) (auto-archives previous version).

Step 5: Test

Upload and run parse_invoice(file_id=...) again. Verify JSON output.

Step 6: User validation

MANDATORY: Show results and ask with AskUserQuestion:

"Os valores extraídos estão corretos?" → "Sim, tudo correto" / "Não, preciso corrigir"

If user rejects → ask what's wrong, fix, update, re-test, ask again.

Workflow — finetuning an existing parser

Upload and run parse_invoice(file_id=...) — note wrong/missing fields
Get source with get_parser_source(name)
If needed, extract raw text with pdftotext <file.pdf> - to debug regex
Diagnose and fix regex issues
Save with update_parser(name, source)
Re-test: upload and run parse_invoice(file_id=...)
User validation (same as create workflow)

Plugin rules

Regex patterns

Use anchored patterns with explicit labels (e.g. "Total a Pagar", "Invoice Amount")
Handle PT format 1.234,56 and EN format 1,234.56 correctly
Dates: always convert to DD-MM-YYYY
Add alternative label patterns for the same field

Number parsing helpers

@staticmethod
def _parse_pt(val: str) -> float:
    """Parse PT format: 1.234,56 → float"""
    if not val or not any(c.isdigit() for c in val):
        return 0.0
    return float(val.replace(".", "").replace(",", "."))

@staticmethod
def _parse_dot(val: str) -> float:
    """Parse EN format: 1,234.56 → float"""
    return float(val.replace(",", ""))

Mixed format heuristic (OCR): dot + ≤2 digits after → EN decimal. >2 digits after dot → PT thousands separator.

OCR safety

Handle garbled text in can_parse — add common OCR variants
Use text.replace(" ", "") when matching NIFs (OCR inserts spaces)
Use [\s\S]*? instead of .*? when label and value span multiple lines
Safety checks in _parse_pt/_parse_dot for empty/non-digit values

Special cases

IVA 0% and subtotal missing → set subtotal = total
Avoid generic € patterns on OCR receipts — can match capital social
Prefer tax line calculation (subtotal + iva) over OCR'd total
Validate VAT rates: only accept {6, 13, 23}%
Fuel receipts: total known but IVA/subtotal missing → subtotal = total / 1.23
Detalhe/extrato documents: detect "detalhe" in filename or "VALORES DETALHADOS" in text. Set confidence=0.0 and skip monetary extraction.
Multi-entity documents (Via Verde): sum all "Total pago em ..." sections
pdftotext column interleaving: use -layout flag or match values by pattern

can_parse specificity

Never use single short substrings as sole identifier
Combine keywords or use full name to avoid false positives
Amazon Business: third-party sellers need special handling — layout varies by country

Gasolina/thermal receipts

Invoice number = sequential part of ATCUD (after -). Ex: JUB5974F-000012335 → 000012335
ATCUD regex: A.?T?\s*CUD[;:\s]+([A-Za-z0-9]+)\s*-\s*(\d+)
Date fallback from filename: _date_from_filename(filename) extracts from xxx-DDMMYYYY.pdf

Confidence calculation

campos_chave = ["numero", "data_emissao", "subtotal", "total"]
preenchidos = sum(1 for c in campos_chave if result[c] is not None)
result["confidence"] = round(preenchidos / len(campos_chave), 2)

learn-document

Invocation

Context Preview

SKILL.md

learn-document

Invocation

Context Preview

SKILL.md

New Parser — Plugin Creator & Finetuner

Pre-flight check (REQUIRED)

Auth token

Upload script

How to parse a PDF (2-step flow)

Step 1: Upload the PDF

Step 2: Parse via MCP tool

Usage

Routing: create vs finetune

MCP Tools

Workflow — creating a new parser

Step 1: Parse and get raw text

Step 2: Analyze the extracted text

Step 3: Write the plugin

Step 4: Register

Step 5: Test

Step 6: User validation

Workflow — finetuning an existing parser

Plugin rules

Regex patterns

Number parsing helpers

OCR safety

Special cases

can_parse specificity

Gasolina/thermal receipts

Confidence calculation

Similar Skills

New Parser — Plugin Creator & Finetuner

Pre-flight check (REQUIRED)

Auth token

Upload script

How to parse a PDF (2-step flow)

Step 1: Upload the PDF

Step 2: Parse via MCP tool

Usage

Routing: create vs finetune

MCP Tools

Workflow — creating a new parser

Step 1: Parse and get raw text

Step 2: Analyze the extracted text

Step 3: Write the plugin

Step 4: Register

Step 5: Test

Step 6: User validation

Workflow — finetuning an existing parser

Plugin rules

Regex patterns

Number parsing helpers

OCR safety

Special cases

can_parse specificity

Gasolina/thermal receipts

Confidence calculation

Similar Skills