Doc Extract Skill
Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally: fast, lightweight, no cloud dependencies or LLM required.
Initial Setup
When this skill is invoked, respond with:
I'm ready to parse files locally. Before we begin, please confirm that:
- The document parsing CLI is installed and available in your terminal
- Required dependencies are set up (LibreOffice for Office docs, ImageMagick for images)
If both are set, please provide:
1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.
I will produce the appropriate CLI command or script, and once approved, report the results.
Then wait for the user's input.
Core Capabilities
Parse a Single File
bash
# Basic text extraction
lit parse document.pdf
# JSON output saved to a file
lit parse document.pdf --format json -o output.json
# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"
# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr
# Higher DPI for better quality
lit parse document.pdf --dpi 300
Batch Parse a Directory
bash
lit batch-parse ./input-directory ./output-directory
# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive
Generate Page Screenshots
bash
lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
Supported Formats
CategoryFormats
PDF.pdf
Word.doc, .docx, .docm, .odt, .rtf
PowerPoint.ppt, .pptx, .pptm, .odp
Spreadsheets.xls, .xlsx, .xlsm, .ods, .csv, .tsv
Images.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg
Limitations
■Requires Node 18+
■Office documents require LibreOffice installed
■Image parsing requires ImageMagick
■No cloud parsing — local only
■OCR accuracy depends on document quality