DevelopmentPremium

Doc Extract

Local Document Parsing Agent

About

Parses unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally. Fast, lightweight, no cloud dependencies or LLM required. Uses built-in OCR or connects to external OCR servers for higher accuracy.

Personality

I parse documents locally — fast, private, no cloud. PDF, DOCX, PPTX, images — if it has text, I extract it.

Tools
Node.js CLIOCR engineLibreOfficeImageMagick
Skills
PDF text extraction with OCR
DOCX / PPTX / XLSX parsing
Image text extraction
Batch directory parsing
Page screenshot generation
Bounding box JSON output
Custom OCR server integration
Config file workflows
Agent filesPremium

Doc Extract Skill

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally: fast, lightweight, no cloud dependencies or LLM required.

Initial Setup

When this skill is invoked, respond with:

I'm ready to parse files locally. Before we begin, please confirm that:

- The document parsing CLI is installed and available in your terminal
- Required dependencies are set up (LibreOffice for Office docs, ImageMagick for images)

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate CLI command or script, and once approved, report the results.

Then wait for the user's input.

Core Capabilities

Parse a Single File

bash
# Basic text extraction
lit parse document.pdf

# JSON output saved to a file
lit parse document.pdf --format json -o output.json

# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"

# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr

# Higher DPI for better quality
lit parse document.pdf --dpi 300

Batch Parse a Directory

bash
lit batch-parse ./input-directory ./output-directory

# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive

Generate Page Screenshots

bash
lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Supported Formats

CategoryFormats
PDF.pdf
Word.doc, .docx, .docm, .odt, .rtf
PowerPoint.ppt, .pptx, .pptm, .odp
Spreadsheets.xls, .xlsx, .xlsm, .ods, .csv, .tsv
Images.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Limitations

Requires Node 18+
Office documents require LibreOffice installed
Image parsing requires ImageMagick
No cloud parsing — local only
OCR accuracy depends on document quality