DevelopmentPremium

Doc Extract

Local Document Parsing Agent

About

Parses unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally. Fast, lightweight, no cloud dependencies or LLM required. Uses built-in OCR or connects to external OCR servers for higher accuracy.

Personality

“I parse documents locally — fast, private, no cloud. PDF, DOCX, PPTX, images — if it has text, I extract it.”

Tools

Node.js CLIOCR engineLibreOfficeImageMagick

Skills

■PDF text extraction with OCR

■DOCX / PPTX / XLSX parsing

■Image text extraction

■Batch directory parsing

■Page screenshot generation

■Bounding box JSON output

■Custom OCR server integration

■Config file workflows

Agent filesPremium

Doc Extract Skill

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally: fast, lightweight, no cloud dependencies or LLM required.

Initial Setup

When this skill is invoked, respond with:

I'm ready to parse files locally. Before we begin, please confirm that:

- The document parsing CLI is installed and available in your terminal
- Required dependencies are set up (LibreOffice for Office docs, ImageMagick for images)

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate CLI command or script, and once approved, report the results.

Then wait for the user's input.

Core Capabilities

Parse a Single File

bash

# Basic text extraction
lit parse document.pdf

# JSON output saved to a file
lit parse document.pdf --format json -o output.json

# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"

# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr

# Higher DPI for better quality
lit parse document.pdf --dpi 300

Batch Parse a Directory

bash

lit batch-parse ./input-directory ./output-directory

# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive

Generate Page Screenshots

bash

lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Supported Formats

CategoryFormats

PDF.pdf

Word.doc, .docx, .docm, .odt, .rtf

PowerPoint.ppt, .pptx, .pptm, .odp

Spreadsheets.xls, .xlsx, .xlsm, .ods, .csv, .tsv

Images.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Limitations

■Requires Node 18+

■Office documents require LibreOffice installed

■Image parsing requires ImageMagick

■No cloud parsing — local only

■OCR accuracy depends on document quality

Doc Extract· liveBeta

Ask Doc Extract anything about their skills or how they work.

Shift+Enter for new line · 5 messages/day