Extract Any PDF with MinerU 2.5 (Easy Tutorial)

Table Of Content
- Introduction
- What is MinerU?
- Table Overview
- System Setup Used
- Core Capabilities at a Glance
- Components Initialized at Runtime
- Observed Language Quality (From Testing)
- Key Features
- How it works
- How to use
- Prerequisites
- Installation (vLLM integration)
- Launch the Web UI
- Alternative Interfaces
- Performance Notes and Tips
- Practical results from testing
- FAQs
- Can I run MinerU without a GPU?
- How fast is MinerU 2.5 with vLLM?
- What output formats are supported?
- Does MinerU handle scanned PDFs?
- Can I keep headers and footers?
- What languages work best?
- Does MinerU support complex layouts?
- Can I use MinerU via CLI or programmatically?
- What are the model components involved?
- How much VRAM do I need?
- Conclusion
Introduction
MinerU has been one of my go-to tools for turning PDFs into machine-readable data. I also keep an eye on alternatives like Docling and vision-language models from Qwen, but MinerU stands out for its balance of structure preservation, conversion quality, and practicality for dataset creation.
MinerU is open-source and converts PDFs into formats such as Markdown and JSON. I rely on it for building custom datasets because it allows targeted extraction and easy reuse. Over time it has matured into a solid option for technical, legal, and business documents.
MinerU did have one drawback: performance. In heavier production scenarios, rendering and serving were slow and a bit complex due to sglang. The recent integration of vLLM changes that. vLLM is a fast inference engine, and MinerU 2.5 with vLLM significantly improves speed while keeping the same extraction quality.
In this guide, I’ll install and run MinerU 2.5 with vLLM locally, outline what it does well, and share practical notes from testing across languages and document types. I used Ubuntu with a single NVIDIA RTX A6000 (48 GB VRAM), but you can run it on Windows, Linux, or macOS, with or without a GPU.
What is MinerU?
MinerU is an open-source PDF-to-structured-data system. It focuses on preserving layout and structure while producing clean outputs such as Markdown, JSON, HTML for tables, and LaTeX for formulas.
It is built for real-world documents where structure matters: multi-column layouts, tables, images, equations, figures, captions, and scanned pages. It removes noise such as headers, footers, footnotes, and page numbers to keep the text coherent, while still supporting options to keep them if needed.
With vLLM integration, MinerU 2.5 gets a major performance boost. The tool can run through multi-page PDFs quickly, maintain page-level structure, and export useful, ready-to-use outputs.
Table Overview
System Setup Used
Component | Configuration |
---|---|
OS | Ubuntu (Linux) |
GPU | NVIDIA RTX A6000, 48 GB VRAM |
CPU | Supported; slower than GPU |
MinerU | Version 2.5 with vLLM integration |
Python Env | Virtual environment recommended |
Install Mode | Editable install for core + VLM |
Core Capabilities at a Glance
Capability | Details |
---|---|
Output Formats | Markdown, JSON, plain text; HTML for tables; LaTeX for formulas |
Layout Handling | Single-column, multi-column, complex layouts |
Structure Cleanup | Removes headers, footers, footnotes, and page numbers (configurable) |
Objects | Text, images, tables, equations, captions |
OCR | Scanned or noisy PDFs; 84 languages |
Interfaces | Web UI, CLI, SDK; hosted option via client |
Engine | vLLM for fast inference; integrates layout/vision components |
Components Initialized at Runtime
Component | Role |
---|---|
Layout/Detection (e.g., YOLO) | Region detection for text, tables, images |
OCR (e.g., Paddle-based) | Text recognition for scanned or low-quality pages |
VLM engine | Fast reasoning, segmentation, structure-aware extraction |
Language Model hints (e.g., Qwen) | Language-aware text tasks and structure understanding |
Note: The project initializes multiple models at first run. Model downloads happen automatically on launch.
Observed Language Quality (From Testing)
Language | Result Summary |
---|---|
English | Accurate structure, fast processing, strong table and image handling |
Chinese | Strong performance; consistent structure retention |
German | Good output; correct tabular data conversion |
Swedish | Good overall match with originals |
Indonesian | Mixed; script is Latin, some fields correct, minor inconsistencies |
Arabic | Weak in Markdown/text extraction; region marking OK |
Hindi | Not extracted; region marking OK |
Urdu | Not extracted; rendered as image only |
Key Features
- Structured conversion
- Markdown and JSON outputs for clean downstream processing
- HTML for tables and LaTeX for formulas
- Layout-aware processing
- Handles single- and multi-column layouts and complex page structures
- Extracts tables, figures, images, captions, and text blocks with preserved order
- Noise removal
- Optional removal of headers, footers, page numbers, and footnotes for coherent text flow
- OCR support
- Works on scanned PDFs and noisy text in 84 languages
- Speed with vLLM
- Significantly faster inference compared with prior sglang-based runs
- Flexible interfaces
- Web UI for interactive use
- CLI for batch processing
- SDK for programmatic integration
- Hosted option (connect via client) if you prefer not to run locally
- Local-first privacy option
- Full local execution and data isolation
How it works
MinerU 2.5 orchestrates several components to extract structure and text from PDFs:
- Layout detection
- Detects blocks such as paragraphs, tables, images, and formulas
- Models like YOLO contribute to robust region detection
- OCR and text recognition
- Applies OCR on scanned or garbled pages
- Helps recover text in noisy or low-resolution documents
- Structure analysis
- Groups related elements, resolves multi-column flow, and removes repeated noise (headers/footers) if enabled
- Table and formula conversion
- Converts tables into HTML with cell structure
- Converts formulas into LaTeX for reproducibility and downstream typesetting
- vLLM-powered reasoning
- Speeds up model inference for segmentation, reading order, and content labeling
- Improves throughput on large or multi-page documents
The result is a set of clean, structured artifacts suitable for downstream data work, content reuse, or search and indexing.
How to use
Prerequisites
- OS: Linux, Windows, or macOS
- Python: Virtual environment recommended
- GPU: Optional but strongly recommended for speed
- RAM/VRAM:
- GPU runs are significantly faster; during testing with a 48 GB GPU, VRAM peaked around 25 GB on large, multi-page docs
- CPU runs work but are slower
Installation (vLLM integration)
Follow these steps to install MinerU 2.5 with vLLM locally:
-
Create and activate a virtual environment
python -m venv .venv && source .venv/bin/activate
(Linux/macOS)python -m venv .venv && .\.venv\Scripts\activate
(Windows)
-
Clone the MinerU repository
git clone <repo-url>
cd <repo-folder>
-
Install MinerU with VLM support in editable mode
pip install -e ".[core,vlm]"
-
Wait for dependencies to install
- Initial setup can take a few minutes
Launch the Web UI
MinerU includes a demo application for the UI:
-
Go to the demo directory
cd demo
-
Launch the demo app
python demo.py
-
First run will download required models and initialize the vLLM engine
- Layout detection, OCR, and VLM components are set up automatically
-
Open the local URL in your browser to access the UI
Alternative Interfaces
- CLI
- Batch-run documents from the terminal, ideal for pipelines
- SDK
- Integrate MinerU into your Python applications for custom workflows
- Hosted
- Connect via the provided client if you prefer managed infrastructure
- Local/private
- All steps above keep the entire workflow on your machine
Performance Notes and Tips
- GPU memory
- Complex, multi-page PDFs can require substantial VRAM; plan for peaks in the 20–25 GB range for larger runs
- CPU runs
- Fully supported but slower; useful for small jobs or environments without GPUs
- Model downloads
- Allow the first run to complete all downloads before testing large documents
- Noise removal
- Default behavior removes headers/footers; you can configure output to keep them if your use case requires
- Exports
- Favor Markdown or JSON for downstream processing
- Use HTML tables and LaTeX formulas when precision is required
Practical results from testing
Once installed and launched with vLLM, MinerU 2.5 processed documents quickly and produced high-quality structure-preserving outputs. Below are condensed observations from a range of documents and languages, following the same order I tested:
-
English technical/business documents
- Accurate Markdown and text exports with correct tables and images
- Multi-page documents processed in seconds
- VRAM usage peaked a little over 25 GB on the GPU during intensive runs
-
Structured spec sheets
- Reliable detection of tables and images
- Clean Markdown plus HTML tables, ready for reuse
-
Image inputs
- Images embedded in PDFs extracted correctly
- Object and layout detection performed well
-
Arabic
- Region marking worked, but Markdown/text extraction was not usable
-
Chinese
- Strong output quality with clean structure
- Headers and footers removed by default; configuration options exist to keep them
- Table and formula recognition worked as expected
-
German
- Good extraction; tabular data cleanly converted
-
Hindi
- Region marking worked, but text extraction did not produce usable output
-
Indonesian
- Mixed outcomes; some content extracted correctly, some inconsistencies remained
-
Swedish
- Visual match with original documents was strong
- Layout and flow preserved
-
Urdu
- Rendered as image with no usable text extraction
-
Formulas and math-heavy sections
- LaTeX outputs looked correct and reproducible
- Complex formula regions were detected and converted cleanly
-
Charts and academic papers
- Images and figures extracted properly
- Overall document structure and labeling showed noticeable improvement over prior runs
Overall, European languages performed well, with English and Chinese particularly strong. Arabic, Hindi, and Urdu were not reliable for text extraction in these tests. Indonesian was mixed. Speed and consistency were notably better with vLLM than earlier sglang-based setups.
FAQs
Can I run MinerU without a GPU?
Yes. CPU runs work, though they’re slower. For heavier workloads or large multi-page PDFs, a GPU is recommended.
How fast is MinerU 2.5 with vLLM?
It’s significantly faster than prior configurations using sglang. Multi-page documents processed in seconds in many cases during testing. Actual speed depends on document complexity, hardware, and batch settings.
What output formats are supported?
- Markdown and JSON for general content
- HTML for tables
- LaTeX for formulas
- Plain text when you need quick extraction
Does MinerU handle scanned PDFs?
Yes. MinerU includes OCR across 84 languages for scanned or garbled PDFs. Quality depends on scan resolution and language.
Can I keep headers and footers?
By default, MinerU removes repeated noise (headers, footers, page numbers, footnotes) to keep flows coherent. You can configure the system to keep them if needed.
What languages work best?
From testing:
- Strong: English, Chinese
- Good: German, Swedish
- Mixed: Indonesian
- Weak/Not extracted: Arabic, Hindi, Urdu
Does MinerU support complex layouts?
Yes. It handles single- and multi-column documents, figures, tables, captions, and multi-page flows while preserving structure.
Can I use MinerU via CLI or programmatically?
Yes. MinerU supports a CLI for batch workflows and an SDK for Python integration. There is also a hosted option via client, as well as full local-only operation.
What are the model components involved?
MinerU initializes layout detection, OCR, and a vision-language model. At runtime you may see components like YOLO for region detection, Paddle-based OCR, and a vLLM-initialized model. There are hints of Qwen in the stack, and improvements are associated with work around the Intern family of models.
How much VRAM do I need?
It varies by document size and complexity. In testing with large, multi-page documents, VRAM peaks were slightly above 25 GB on a 48 GB GPU. Smaller cases need much less. CPU runs avoid VRAM constraints but are slower.
Conclusion
MinerU 2.5 with vLLM is a strong choice for converting PDFs into structured, reusable data. It preserves document layout, produces clean Markdown/JSON, and includes accurate table and formula conversions (HTML and LaTeX). With OCR across 84 languages, it can recover text from scanned or noisy pages.
The vLLM integration brings a clear speed boost compared with earlier setups, reducing friction for production workflows. In testing, English and Chinese were particularly strong, with European languages generally solid. Arabic, Hindi, and Urdu did not yield usable text; Indonesian was mixed. For many technical, business, and academic documents, MinerU’s structure retention and export fidelity are exactly what’s needed for downstream processing and dataset creation.
You can run MinerU on Linux, Windows, or macOS, on CPU or GPU. The project offers a web UI, CLI, SDK, and a hosted option. If you want local-only processing, the fully private workflow is straightforward. With vLLM in place, MinerU 2.5 is a practical tool for fast, reliable, structure-aware PDF extraction.
Related Posts

K2 Think: Install & Advanced Reasoning AI
Discover K2 Think from the UAE: step-by-step local install, architecture overview, and real benchmark tests. See how this advanced reasoning AI performs.

Klear-46B by (Kwai): Local Install Guide
Install and run Kwai's Klear-46B MoE locally. Get setup steps, hardware tips, and real-world performance results from this efficient 46B sparse LLM.

Install Moondream 3 Locally
Step-by-step guide to install Moondream 3 locally—a small, blazing-fast vision-language model delivering frontier-level reasoning for real-world tasks.