From mixedbread-skills
Parse documents, extract structured content, and run OCR using the Mixedbread Parsing API. Use when parsing PDFs, Word documents, PowerPoint slides, or images, extracting tables or form fields, running OCR on scanned documents, converting documents to markdown or HTML, or extracting structured chunks with element-level bounding boxes and confidence scores.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mixedbread-skills:mixedbread-parsingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Parse documents, extract structured content, and run OCR using the Parsing API. Supports PDFs, Word documents, PowerPoint presentations, and images.
Parse documents, extract structured content, and run OCR using the Parsing API. Supports PDFs, Word documents, PowerPoint presentations, and images.
Docs: https://www.mixedbread.com/docs/parsing/overview.md Agent-readable docs: https://www.mixedbread.com/docs/llms.txt Latest docs search: https://www.mixedbread.com/question?q=parsing§ion=docs
pip install mixedbread # Python
npm install @mixedbread/sdk # TypeScript
export MXBAI_API_KEY=your_api_key
Python:
from mixedbread import Mixedbread
mxbai = Mixedbread()
# Upload and parse a document (waits for completion)
job = mxbai.parsing.jobs.upload_and_poll(
file=open("report.pdf", "rb"),
return_format="markdown",
)
for chunk in job.result.chunks:
print(chunk.content)
TypeScript:
import Mixedbread from '@mixedbread/sdk';
import fs from 'fs';
const mxbai = new Mixedbread();
const job = await mxbai.parsing.jobs.uploadAndPoll(
fs.createReadStream('report.pdf'),
{ return_format: 'markdown' },
);
for (const chunk of job.result.chunks) {
console.log(chunk.content);
}
upload_and_poll() (uploads + creates job + polls)create_and_poll() (creates job + polls)upload() or create() then poll() separatelyfast mode. Fastest, lowest cost. Extracts text, structure, and layout.high_quality mode. Uses OCR. Extracts text with confidence scores, handles rotated/skewed pages, multi-column layouts.element_types to reduce processing timePDF (.pdf), Word (.doc, .docx, .dotx, .docm, .dotm, .odt, .rtf), Slides (.ppt, .pptx, .ppsx, .pptm, .potm, .ppsm, .odp), Images (.jpeg, .png, .webp, .avif).
Element types: text, title, section-header, header, footer, page-number, list-item, figure, picture, table, form, footnote, caption, formula.
Filter for table elements to pull structured data from reports.
Python:
job = mxbai.parsing.jobs.upload_and_poll(
file=open("financial-report.pdf", "rb"),
element_types=["table"],
return_format="html",
mode="high_quality",
)
for chunk in job.result.chunks:
for element in chunk.elements:
if element.type == "table":
print(f"Page {element.page}, confidence {element.confidence:.2f}")
print(element.content)
TypeScript:
const job = await mxbai.parsing.jobs.uploadAndPoll(
fs.createReadStream('financial-report.pdf'),
{ element_types: ['table'], return_format: 'html', mode: 'high_quality' },
);
for (const chunk of job.result.chunks) {
for (const element of chunk.elements) {
if (element.type === 'table') {
console.log(`Page ${element.page}, confidence ${element.confidence.toFixed(2)}`);
console.log(element.content);
}
}
}
Upload multiple files asynchronously, then poll all jobs:
Python:
import os
jobs = []
for filename in os.listdir("./documents"):
if filename.endswith(".pdf"):
job = mxbai.parsing.jobs.upload(
file=open(f"./documents/{filename}", "rb"),
return_format="markdown",
)
jobs.append(job)
# Poll all jobs
for job in jobs:
completed = mxbai.parsing.jobs.poll(job_id=job.id)
print(f"{completed.filename}: {len(completed.result.chunks)} chunks")
TypeScript:
import { readdirSync, createReadStream } from 'fs';
import path from 'path';
const files = readdirSync('./documents').filter(f => f.endsWith('.pdf'));
const jobs = await Promise.all(
files.map(f => mxbai.parsing.jobs.upload(
createReadStream(path.join('./documents', f)),
{ return_format: 'markdown' },
)),
);
// Poll all jobs
for (const job of jobs) {
const completed = await mxbai.parsing.jobs.poll(job.id);
console.log(`${completed.filename}: ${completed.result.chunks.length} chunks`);
}
parsing_strategy: "high_quality" automatically get OCR text (images), summaries (images), and transcriptions (audio & video) extracted. These are available as fields on search result chunks. There is no benefit to also running the Parsing API on the same file. Use the Parsing API only for standalone document extraction outside of stores.upload_and_poll() / create_and_poll() instead of manual polling loops. These methods handle backoff automatically. Manual while loops with retrieve() are fragile and waste API calls.element_types when you only need certain elements. Requesting all types increases processing time and response size. If you only need tables, set element_types to table only.fast mode for born-digital PDFs. The high_quality mode adds OCR overhead that provides no benefit when text is already selectable.confidence scores on OCR output. Low-confidence elements (< 0.5) may contain errors. Filter or flag them.job.error before retrying failed jobs. Common causes: unsupported file type, corrupt file, file too large. Blindly retrying wastes quota.content_to_embed for embedding pipelines. Each chunk provides both content (full text) and content_to_embed (optimized for embedding). Use the latter when feeding into vector stores outside Mixedbread.| Symptom | Cause | Fix |
|---|---|---|
Job stuck in pending | Queue is busy | Use poll() with a longer poll_timeout_ms. Check job status with retrieve(). |
Job status failed | Unsupported file type, corrupt file, or file too large | Check job.error for details. Verify file format is supported. |
| Empty chunks in result | File has no extractable content (blank pages) | Verify the file has content. Try high_quality mode for scanned documents. |
| Low confidence scores | Scanned or low-resolution source | Use high_quality mode for better OCR accuracy. |
| Missing tables or figures | Element types not requested | Set element_types to include table and figure explicitly. |
upload_and_poll() timeout | Very large document or slow processing | Increase poll_timeout_ms, or use upload() + poll() separately for more control. |
npx claudepluginhub mixedbread-ai/skills --plugin mixedbread-searchSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.