Generates a structured extraction plan and spreadsheet template for extracting tabular data from PDFs, specifying column headers, data types, and common pitfalls.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autopunk-media-skills:pdf-table-extractor-briefThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Produces a structured extraction plan and clean spreadsheet template for pulling tabular data out of a PDF document — identifying the table structure, defining column headers, flagging extraction pitfalls, and providing a ready-to-use template that ensures the data lands in a consistent, analysable format.
Produces a structured extraction plan and clean spreadsheet template for pulling tabular data out of a PDF document — identifying the table structure, defining column headers, flagging extraction pitfalls, and providing a ready-to-use template that ensures the data lands in a consistent, analysable format.
Required: A description of the PDF and its table structure — the document title, what the table contains, how many columns and rows (approximate is fine), and what the data represents. If possible, paste or describe the first few rows of the table so the assistant can see the structure. Alternatively, provide the actual PDF content if you have extracted the raw text.
Optional: The extraction tool you plan to use (Tabula, Camelot, manual copy-paste, Adobe Export, or "not sure — recommend one"); any known problems with the PDF (merged cells, multi-line row entries, footnotes embedded in the table, inconsistent formatting across pages); the analysis you plan to run on the extracted data (this helps the assistant optimize the template for your downstream use case).
Analyzes the table structure. From the description or pasted content, identifies: number of columns, data types per column (text, integers, currency, percentages, dates), header rows vs. data rows, whether the table spans multiple pages, and whether any cells are merged or contain multi-line entries. This structural map is the foundation of the extraction plan.
Defines clean column headers. Creates standardized, machine-readable column headers for the spreadsheet template — converting the PDF's often-verbose or ambiguous headers into clear, consistent labels. Specifies the expected data type and format for each column (e.g., "Budget_Amount: numeric, no currency symbols, no commas, two decimal places").
Flags extraction pitfalls. Identifies the specific problems most likely to occur with this table: merged cells that will break row alignment, footnote markers that will pollute numeric fields, subtotal rows that should be excluded from analysis, header rows that repeat on each page, inconsistent date formats, or columns that contain mixed data types.
Writes the extraction instructions. Provides step-by-step instructions for the chosen extraction method — or recommends the best tool for this specific table type. Instructions are written for a non-technical user: which tool to use, what settings to select, what to check after the first pass, and how to handle the flagged pitfalls.
Provides the spreadsheet template. Delivers a ready-to-use template with column headers, data type specifications, and a notes column for flagging uncertain values. Includes one example row filled in from the provided data so the user can see exactly what a correctly extracted row looks like.
A structured extraction brief of 300-500 words. Opens with a "Table Overview" summarizing the structure. Then "Column Definitions" (a table of headers with data types). Then "Extraction Pitfalls" (numbered list of specific risks). Then "Step-by-Step Instructions" (for the chosen or recommended tool). Closes with "Template" (the spreadsheet template in markdown table format with one example row). Tone is instructional and precise — written for a journalist or research assistant, not a data engineer.
**Table Overview**
[Structure summary: rows, columns, data types, spanning behavior]
**Column Definitions**
| Column Header | Source Label | Data Type | Format Notes |
|---|---|---|---|
| [clean_header] | [original PDF label] | [type] | [formatting rules] |
**Extraction Pitfalls**
1. [Specific risk and how to handle it]
2. [Specific risk]
**Recommended Tool & Instructions**
[Step-by-step extraction process]
**Template**
| [Header 1] | [Header 2] | [Header 3] | Notes |
|---|---|---|---|
| [example data] | [example data] | [example data] | [example note] |
Document: Annual municipal budget summary for fiscal year 2025, published by a mid-sized city's finance department. 48-page PDF.
Table to extract: "Department Expenditure Summary" on pages 12-14. Contains approximately 35 rows (one per department) and 6 columns: Department Name, FY2023 Actual, FY2024 Budgeted, FY2024 Actual, FY2025 Proposed, and Percent Change. The table spans three pages with the header row repeating on each page. There are subtotal rows for each division (Public Safety, Public Works, Administration, Community Services) and a grand total row at the bottom.
Planned use: Compare proposed 2025 spending to actual 2024 spending by department to identify which departments are seeing the largest increases and decreases.
Extraction tool: Not sure — recommend one.
Table Overview A 35-row, 6-column expenditure table spanning pages 12-14. Rows represent individual departments grouped under four division subtotals. All financial columns are in U.S. dollars (likely formatted with commas and dollar signs in the PDF). The "Percent Change" column compares FY2025 Proposed to FY2024 Actual. Header row repeats on each page. Four subtotal rows and one grand total row are present and should be excluded from department-level analysis.
Column Definitions
| Column Header | Source Label | Data Type | Format Notes |
|---|---|---|---|
| department_name | Department Name | text | Trim whitespace; use exact name from PDF |
| division | [inferred from subtotals] | text | Public Safety / Public Works / Administration / Community Services — assign based on position under subtotal row |
| fy2023_actual | FY2023 Actual | numeric | Remove $ and commas; two decimal places |
| fy2024_budgeted | FY2024 Budgeted | numeric | Remove $ and commas; two decimal places |
| fy2024_actual | FY2024 Actual | numeric | Remove $ and commas; two decimal places |
| fy2025_proposed | FY2025 Proposed | numeric | Remove $ and commas; two decimal places |
| pct_change | % Change | numeric | Remove % sign; store as decimal (e.g., 12.5 not 0.125) |
| notes | [not in source] | text | Flag any value that required judgment during extraction |
Extraction Pitfalls
Recommended Tool & Instructions Use Tabula (free, open-source, available at tabula.technology). It handles multi-page tables with repeating headers well and exports directly to CSV.
Template
| department_name | division | fy2023_actual | fy2024_budgeted | fy2024_actual | fy2025_proposed | pct_change | notes |
|---|---|---|---|---|---|---|---|
| Fire and Rescue Services | Public Safety | 14523800.00 | 15100000.00 | 14987350.00 | 16250000.00 | 8.4 | |
| Parks and Recreation | Community Services | 3842100.00 | 4000000.00 | 3756000.00 | 4125000.00 | 9.8 | FY2024 actual lower than budgeted — verify if mid-year cut |
npx claudepluginhub ur-grue/autopunk-media-skills --plugin autopunk-media-skillsCreate, edit, analyze office documents (PDF, DOCX, PPTX, XLSX): extract text/tables, merge/split, fill forms, data analysis using pdfplumber, pypdf, pandas, openpyxl.
Extracts structured data from batches of documents into a spreadsheet with cited sources. For M&A diligence, contract audits, or any batch review needing a table output.
Creates, edits, analyzes spreadsheets (.xlsx, .csv, .tsv) using openpyxl and pandas; preserves formulas, formatting, references for accurate Excel workflows.