MinerU Guide: Parse PDFs, Office Files, and Images into RAG-Ready Markdown/JSON

MinerU turns complex documents into Markdown and JSON that LLM apps can consume reliably—built for RAG, knowledge bases, and agent data prep.

At a Glance

opendatalab/MinerU is a document parsing tool for LLM data preparation. It converts PDFs, images, DOCX, PPTX, XLSX, and similar inputs into Markdown, JSON, and intermediate structured results for RAG, information extraction, knowledge bases, or agent workflows.

Project and docs:

Problems It Fits Best

MinerU shines when you need to:

Parse papers, reports, contracts, and manuals into Markdown.
Prepare cleaner chunking input for RAG knowledge bases.
Extract text, tables, and formulas from scanned PDFs or images.
Normalize DOCX, PPTX, and XLSX into structured data.
Batch-process documents locally or in private environments.
Feed LangChain, LlamaIndex, Dify, RAGFlow, FastGPT, and similar stacks.

For simple, text-native PDFs, plain extractors may suffice. MinerU pays off with complex layouts, tables, formulas, multi-format inputs, and production-scale document pipelines.

Core Capabilities

Per the README, MinerU accepts PDF, images, DOCX, PPTX, and XLSX and outputs Markdown, reading-order JSON, and visualizations for quality checks.

Highlights:

Strip headers, footers, footnotes, page numbers, and noise.
Emit text in human reading order—single column, multi-column, complex layouts.
Preserve headings, paragraphs, lists, and document structure.
Extract figures, captions, tables, table titles, and footnotes.
Recognize formulas and emit LaTeX.
Recognize tables and emit HTML.
Auto-detect scanned and garbled PDFs and enable OCR.
OCR across 109 languages.
CLI, FastAPI, Gradio WebUI, and mineru-router.

Version 3.1.0 (April 2026) added native PPTX/XLSX parsing and upgraded the main VLM to MinerU2.5-Pro-2604-1.2B. Version 3.2.3 (June 4, 2026) added sub/superscript detection and post-OCR fallback.

Installation

For local trials, the docs recommend uv plus the full package:

pip install --upgrade pip
pip install uv
uv pip install -U "mineru[all]"

From source:

git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[all]

mineru[all] ships core features on Windows, Linux, and macOS. Parsing is sensitive to GPU, inference runtimes, Python versions, and OS details—validate on small samples before production.

First Parse

Basic usage—input and output paths:

mineru -p <input_path> -o <output_path>

Without GPU acceleration, use the pipeline backend on CPU:

mineru -p <input_path> -o <output_path> -b pipeline

<input_path> can be a file or directory. Start with a small sample set:

mineru -p ./samples -o ./output -b pipeline

Review quality, latency, memory, and output layout before scaling to the full corpus.

Using the Output

RAG

Feed Markdown into chunking and embedding so headings, paragraphs, lists, tables, and formulas keep semantic structure. Structured Markdown chunks and cites better than flat OCR blobs.

Information extraction

JSON and intermediate artifacts suit scripts that pull tables, formulas, captions, or sections—more stable than raw text for report, paper, or contract field automation.

Human review

Layout and span visualizations help spot missing content, wrong order, or broken tables. Sample these before large batches.

Backend Options

Main routes in the docs:

pipeline: broad compatibility on CPU or GPU—good for trials and routine batch jobs.
vlm-engine: higher accuracy, heavier hardware—complex documents and quality-first parsing.
hybrid-engine: native text extraction plus high-accuracy parsing—less hallucination on hard layouts.
*-http-client: OpenAI-compatible APIs for local or remote inference.

Start with pipeline to validate. Move to VLM or hybrid after you know document types, quality bars, and volume. For internal docs, match backend choice to data residency rules.

Deployment Modes

CLI, local API, Gradio WebUI, Docker, and mineru-router.

Solo trial: CLI.
Non-technical users: Gradio WebUI.
System integration: FastAPI / REST.
Multi-service, multi-GPU, high concurrency: mineru-router.
Lower env friction on Linux/WSL2: Docker.

Docker targets Linux and WSL2 Windows; macOS users often prefer pip/uv installs.

vs Plain OCR

Plain OCR asks “what characters are in the image?” RAG also needs paragraph order, heading hierarchy, table structure, formula notation, figure context, and traceability.

MinerU is document-understanding preprocessing—layout analysis, reading order, table HTML, formula LaTeX, multi-format input, structured output—not just character recognition.

Light OCR or text extraction may win on simple invoices, single-page images, or text PDFs. MinerU fits when layout complexity already hurts downstream quality.

vs PaddleOCR, Marker, Unstructured

Overlap exists; entry points differ:

PaddleOCR: OCR building blocks for custom pipelines.
Marker: PDF → Markdown for readable conversion.
Unstructured: enterprise document ETL into search/retrieval flows.
MinerU: LLM/RAG/agent data prep—complex layouts, tables, formulas, multi-format input, VLM + OCR dual engines, private deployment.

If your corpus is papers, reports, textbooks, decks, and spreadsheets heading into LLM apps, MinerU deserves a dedicated trial.

Batch Processing Tips

Before full-scale runs:

Pick 10–20 representative docs—scans, hard tables, multi-column papers, PPT, Excel.
Parse with pipeline; log time, memory, output size, failures.
Spot-check Markdown, JSON, and visuals—order, tables, formulas, captions.
Retry weak samples with VLM or hybrid backends.
Lock output schema, then wire RAG chunking, embedding, and citation.

Don’t dump the entire library on day one. Failures are specific—scan type, table style, font, language direction, cross-page content. Find boundaries first, then scale.

Privacy and Compliance

For internal docs, customer data, contracts, financials, or unpublished research, map data flows before processing.

Check:

Whether content leaves the network to external model services.
Local vs remote vs OpenAI-compatible inference.
Whether temp files hold full text, images, tables, and sensitive fields.
Whether Markdown/JSON lands in logs, object storage, or shared folders.
Whether failed batches get uploaded to issues, forums, or third-party debug tools.

Private/offline deployment is supported—not every config is offline by default. Draw the path from input → temp dirs → inference → output → logs.

When to Skip It

Consider skipping MinerU when:

Simple PDF text extraction is enough.
You need a one-off read of a few pages, not structured output.
Hardware can’t justify parse cost.
OCR quality requires heavy manual cleanup.
Private data can’t enter the current inference path.
There’s no clear RAG, extraction, or knowledge-base consumer yet.

Parsing should serve a downstream workflow—not “parse for parsing’s sake.” Align sample output with consumers before bulk investment.

Summary

MinerU converts complex documents into Markdown and JSON for LLM applications—PDFs, images, Office files, tables, formulas, OCR, multilingual recognition, and local deployment, especially for RAG, knowledge bases, and agent pipelines.

A safe path: try online or small local samples, run pipeline end to end, then upgrade to VLM, hybrid, API, or multi-service deployment based on accuracy and throughput needs.