TIME WAIT BLOG.
#RAG June 7, 2026 11 MIN READ

MinerU Guide: Parse PDFs, Office Files, and Images into RAG-Ready Markdown/JSON

opendatalab/MinerU converts PDFs, Office files, and images to Markdown/JSON with tables, formulas, OCR, and on-prem deployment for RAG pipelines.

MinerU Guide: Parse PDFs, Office Files, and Images into RAG-Ready Markdown/JSON

MinerU turns complex documents into Markdown and JSON that LLM apps can consume reliably—built for RAG, knowledge bases, and agent data prep.

At a Glance

opendatalab/MinerU is a document parsing tool for LLM data preparation. It converts PDFs, images, DOCX, PPTX, XLSX, and similar inputs into Markdown, JSON, and intermediate structured results for RAG, information extraction, knowledge bases, or agent workflows.

Project and docs:

Problems It Fits Best

MinerU shines when you need to:

For simple, text-native PDFs, plain extractors may suffice. MinerU pays off with complex layouts, tables, formulas, multi-format inputs, and production-scale document pipelines.

Core Capabilities

Per the README, MinerU accepts PDF, images, DOCX, PPTX, and XLSX and outputs Markdown, reading-order JSON, and visualizations for quality checks.

Highlights:

Version 3.1.0 (April 2026) added native PPTX/XLSX parsing and upgraded the main VLM to MinerU2.5-Pro-2604-1.2B. Version 3.2.3 (June 4, 2026) added sub/superscript detection and post-OCR fallback.

Installation

For local trials, the docs recommend uv plus the full package:

pip install --upgrade pip
pip install uv
uv pip install -U "mineru[all]"

From source:

git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[all]

mineru[all] ships core features on Windows, Linux, and macOS. Parsing is sensitive to GPU, inference runtimes, Python versions, and OS details—validate on small samples before production.

First Parse

Basic usage—input and output paths:

mineru -p <input_path> -o <output_path>

Without GPU acceleration, use the pipeline backend on CPU:

mineru -p <input_path> -o <output_path> -b pipeline

<input_path> can be a file or directory. Start with a small sample set:

mineru -p ./samples -o ./output -b pipeline

Review quality, latency, memory, and output layout before scaling to the full corpus.

Using the Output

RAG

Feed Markdown into chunking and embedding so headings, paragraphs, lists, tables, and formulas keep semantic structure. Structured Markdown chunks and cites better than flat OCR blobs.

Information extraction

JSON and intermediate artifacts suit scripts that pull tables, formulas, captions, or sections—more stable than raw text for report, paper, or contract field automation.

Human review

Layout and span visualizations help spot missing content, wrong order, or broken tables. Sample these before large batches.

Backend Options

Main routes in the docs:

Start with pipeline to validate. Move to VLM or hybrid after you know document types, quality bars, and volume. For internal docs, match backend choice to data residency rules.

Deployment Modes

CLI, local API, Gradio WebUI, Docker, and mineru-router.

Docker targets Linux and WSL2 Windows; macOS users often prefer pip/uv installs.

vs Plain OCR

Plain OCR asks “what characters are in the image?” RAG also needs paragraph order, heading hierarchy, table structure, formula notation, figure context, and traceability.

MinerU is document-understanding preprocessing—layout analysis, reading order, table HTML, formula LaTeX, multi-format input, structured output—not just character recognition.

Light OCR or text extraction may win on simple invoices, single-page images, or text PDFs. MinerU fits when layout complexity already hurts downstream quality.

vs PaddleOCR, Marker, Unstructured

Overlap exists; entry points differ:

If your corpus is papers, reports, textbooks, decks, and spreadsheets heading into LLM apps, MinerU deserves a dedicated trial.

Batch Processing Tips

Before full-scale runs:

  1. Pick 10–20 representative docs—scans, hard tables, multi-column papers, PPT, Excel.
  2. Parse with pipeline; log time, memory, output size, failures.
  3. Spot-check Markdown, JSON, and visuals—order, tables, formulas, captions.
  4. Retry weak samples with VLM or hybrid backends.
  5. Lock output schema, then wire RAG chunking, embedding, and citation.

Don’t dump the entire library on day one. Failures are specific—scan type, table style, font, language direction, cross-page content. Find boundaries first, then scale.

Privacy and Compliance

For internal docs, customer data, contracts, financials, or unpublished research, map data flows before processing.

Check:

Private/offline deployment is supported—not every config is offline by default. Draw the path from input → temp dirs → inference → output → logs.

When to Skip It

Consider skipping MinerU when:

Parsing should serve a downstream workflow—not “parse for parsing’s sake.” Align sample output with consumers before bulk investment.

Summary

MinerU converts complex documents into Markdown and JSON for LLM applications—PDFs, images, Office files, tables, formulas, OCR, multilingual recognition, and local deployment, especially for RAG, knowledge bases, and agent pipelines.

A safe path: try online or small local samples, run pipeline end to end, then upgrade to VLM, hybrid, API, or multi-service deployment based on accuracy and throughput needs.

/related_artifacts

LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services
#Inference Jun 25, 2026

LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services

LMCache extracts reusable KV Cache from repeated prefills to cut vLLM TTFT—best for long prompts and high prefix overlap.

read full log arrow_right_alt
Vector Embeddings: The New Universal Interface
#Databases Jun 28, 2024

Vector Embeddings: The New Universal Interface

Retrieval systems are shifting from relational lookups to semantic similarity.

read full log arrow_right_alt