Sonu Sahani logo
Sonusahani.com
AI

Extract Any PDF with MinerU 2.5 (Easy Tutorial)

Extract Any PDF with MinerU 2.5 (Easy Tutorial)
0 views
11 min read
#AI

Introduction

MinerU has been one of my go-to tools for turning PDFs into machine-readable data. I also keep an eye on alternatives like Docling and vision-language models from Qwen, but MinerU stands out for its balance of structure preservation, conversion quality, and practicality for dataset creation.

MinerU is open-source and converts PDFs into formats such as Markdown and JSON. I rely on it for building custom datasets because it allows targeted extraction and easy reuse. Over time it has matured into a solid option for technical, legal, and business documents.

MinerU did have one drawback: performance. In heavier production scenarios, rendering and serving were slow and a bit complex due to sglang. The recent integration of vLLM changes that. vLLM is a fast inference engine, and MinerU 2.5 with vLLM significantly improves speed while keeping the same extraction quality.

In this guide, I’ll install and run MinerU 2.5 with vLLM locally, outline what it does well, and share practical notes from testing across languages and document types. I used Ubuntu with a single NVIDIA RTX A6000 (48 GB VRAM), but you can run it on Windows, Linux, or macOS, with or without a GPU.

What is MinerU?

MinerU is an open-source PDF-to-structured-data system. It focuses on preserving layout and structure while producing clean outputs such as Markdown, JSON, HTML for tables, and LaTeX for formulas.

Extract Any PDF with MinerU 2.5

It is built for real-world documents where structure matters: multi-column layouts, tables, images, equations, figures, captions, and scanned pages. It removes noise such as headers, footers, footnotes, and page numbers to keep the text coherent, while still supporting options to keep them if needed.

With vLLM integration, MinerU 2.5 gets a major performance boost. The tool can run through multi-page PDFs quickly, maintain page-level structure, and export useful, ready-to-use outputs.

Table Overview

System Setup Used

ComponentConfiguration
OSUbuntu (Linux)
GPUNVIDIA RTX A6000, 48 GB VRAM
CPUSupported; slower than GPU
MinerUVersion 2.5 with vLLM integration
Python EnvVirtual environment recommended
Install ModeEditable install for core + VLM

Core Capabilities at a Glance

CapabilityDetails
Output FormatsMarkdown, JSON, plain text; HTML for tables; LaTeX for formulas
Layout HandlingSingle-column, multi-column, complex layouts
Structure CleanupRemoves headers, footers, footnotes, and page numbers (configurable)
ObjectsText, images, tables, equations, captions
OCRScanned or noisy PDFs; 84 languages
InterfacesWeb UI, CLI, SDK; hosted option via client
EnginevLLM for fast inference; integrates layout/vision components

Components Initialized at Runtime

ComponentRole
Layout/Detection (e.g., YOLO)Region detection for text, tables, images
OCR (e.g., Paddle-based)Text recognition for scanned or low-quality pages
VLM engineFast reasoning, segmentation, structure-aware extraction
Language Model hints (e.g., Qwen)Language-aware text tasks and structure understanding

Note: The project initializes multiple models at first run. Model downloads happen automatically on launch.

Observed Language Quality (From Testing)

LanguageResult Summary
EnglishAccurate structure, fast processing, strong table and image handling
ChineseStrong performance; consistent structure retention
GermanGood output; correct tabular data conversion
SwedishGood overall match with originals
IndonesianMixed; script is Latin, some fields correct, minor inconsistencies
ArabicWeak in Markdown/text extraction; region marking OK
HindiNot extracted; region marking OK
UrduNot extracted; rendered as image only

Key Features

  • Structured conversion
    • Markdown and JSON outputs for clean downstream processing
    • HTML for tables and LaTeX for formulas
  • Layout-aware processing
    • Handles single- and multi-column layouts and complex page structures
    • Extracts tables, figures, images, captions, and text blocks with preserved order
  • Noise removal
    • Optional removal of headers, footers, page numbers, and footnotes for coherent text flow
  • OCR support
    • Works on scanned PDFs and noisy text in 84 languages
  • Speed with vLLM
    • Significantly faster inference compared with prior sglang-based runs
  • Flexible interfaces
    • Web UI for interactive use
    • CLI for batch processing
    • SDK for programmatic integration
    • Hosted option (connect via client) if you prefer not to run locally
  • Local-first privacy option
    • Full local execution and data isolation

How it works

MinerU 2.5 orchestrates several components to extract structure and text from PDFs:

  • Layout detection
    • Detects blocks such as paragraphs, tables, images, and formulas
    • Models like YOLO contribute to robust region detection
  • OCR and text recognition
    • Applies OCR on scanned or garbled pages
    • Helps recover text in noisy or low-resolution documents
  • Structure analysis
    • Groups related elements, resolves multi-column flow, and removes repeated noise (headers/footers) if enabled
  • Table and formula conversion
    • Converts tables into HTML with cell structure
    • Converts formulas into LaTeX for reproducibility and downstream typesetting
  • vLLM-powered reasoning
    • Speeds up model inference for segmentation, reading order, and content labeling
    • Improves throughput on large or multi-page documents

The result is a set of clean, structured artifacts suitable for downstream data work, content reuse, or search and indexing.

How to use

Prerequisites

  • OS: Linux, Windows, or macOS
  • Python: Virtual environment recommended
  • GPU: Optional but strongly recommended for speed
  • RAM/VRAM:
    • GPU runs are significantly faster; during testing with a 48 GB GPU, VRAM peaked around 25 GB on large, multi-page docs
    • CPU runs work but are slower

Installation (vLLM integration)

Follow these steps to install MinerU 2.5 with vLLM locally:

  1. Create and activate a virtual environment

    • python -m venv .venv && source .venv/bin/activate (Linux/macOS)
    • python -m venv .venv && .\.venv\Scripts\activate (Windows)
  2. Clone the MinerU repository

    • git clone <repo-url>
    • cd <repo-folder>
  3. Install MinerU with VLM support in editable mode

    • pip install -e ".[core,vlm]"
  4. Wait for dependencies to install

    • Initial setup can take a few minutes

Launch the Web UI

MinerU includes a demo application for the UI:

  1. Go to the demo directory

    • cd demo
  2. Launch the demo app

    • python demo.py
  3. First run will download required models and initialize the vLLM engine

    • Layout detection, OCR, and VLM components are set up automatically
  4. Open the local URL in your browser to access the UI

Alternative Interfaces

  • CLI
    • Batch-run documents from the terminal, ideal for pipelines
  • SDK
    • Integrate MinerU into your Python applications for custom workflows
  • Hosted
    • Connect via the provided client if you prefer managed infrastructure
  • Local/private
    • All steps above keep the entire workflow on your machine

Performance Notes and Tips

  • GPU memory
    • Complex, multi-page PDFs can require substantial VRAM; plan for peaks in the 20–25 GB range for larger runs
  • CPU runs
    • Fully supported but slower; useful for small jobs or environments without GPUs
  • Model downloads
    • Allow the first run to complete all downloads before testing large documents
  • Noise removal
    • Default behavior removes headers/footers; you can configure output to keep them if your use case requires
  • Exports
    • Favor Markdown or JSON for downstream processing
    • Use HTML tables and LaTeX formulas when precision is required

Practical results from testing

Once installed and launched with vLLM, MinerU 2.5 processed documents quickly and produced high-quality structure-preserving outputs. Below are condensed observations from a range of documents and languages, following the same order I tested:

  • English technical/business documents

    • Accurate Markdown and text exports with correct tables and images
    • Multi-page documents processed in seconds
    • VRAM usage peaked a little over 25 GB on the GPU during intensive runs
  • Structured spec sheets

    • Reliable detection of tables and images
    • Clean Markdown plus HTML tables, ready for reuse
  • Image inputs

    • Images embedded in PDFs extracted correctly
    • Object and layout detection performed well
  • Arabic

    • Region marking worked, but Markdown/text extraction was not usable
  • Chinese

    • Strong output quality with clean structure
    • Headers and footers removed by default; configuration options exist to keep them
    • Table and formula recognition worked as expected
  • German

    • Good extraction; tabular data cleanly converted
  • Hindi

    • Region marking worked, but text extraction did not produce usable output
  • Indonesian

    • Mixed outcomes; some content extracted correctly, some inconsistencies remained
  • Swedish

    • Visual match with original documents was strong
    • Layout and flow preserved
  • Urdu

    • Rendered as image with no usable text extraction
  • Formulas and math-heavy sections

    • LaTeX outputs looked correct and reproducible
    • Complex formula regions were detected and converted cleanly
  • Charts and academic papers

    • Images and figures extracted properly
    • Overall document structure and labeling showed noticeable improvement over prior runs

Overall, European languages performed well, with English and Chinese particularly strong. Arabic, Hindi, and Urdu were not reliable for text extraction in these tests. Indonesian was mixed. Speed and consistency were notably better with vLLM than earlier sglang-based setups.

FAQs

Can I run MinerU without a GPU?

Yes. CPU runs work, though they’re slower. For heavier workloads or large multi-page PDFs, a GPU is recommended.

How fast is MinerU 2.5 with vLLM?

It’s significantly faster than prior configurations using sglang. Multi-page documents processed in seconds in many cases during testing. Actual speed depends on document complexity, hardware, and batch settings.

What output formats are supported?

  • Markdown and JSON for general content
  • HTML for tables
  • LaTeX for formulas
  • Plain text when you need quick extraction

Does MinerU handle scanned PDFs?

Yes. MinerU includes OCR across 84 languages for scanned or garbled PDFs. Quality depends on scan resolution and language.

Can I keep headers and footers?

By default, MinerU removes repeated noise (headers, footers, page numbers, footnotes) to keep flows coherent. You can configure the system to keep them if needed.

What languages work best?

From testing:

  • Strong: English, Chinese
  • Good: German, Swedish
  • Mixed: Indonesian
  • Weak/Not extracted: Arabic, Hindi, Urdu

Does MinerU support complex layouts?

Yes. It handles single- and multi-column documents, figures, tables, captions, and multi-page flows while preserving structure.

Can I use MinerU via CLI or programmatically?

Yes. MinerU supports a CLI for batch workflows and an SDK for Python integration. There is also a hosted option via client, as well as full local-only operation.

What are the model components involved?

MinerU initializes layout detection, OCR, and a vision-language model. At runtime you may see components like YOLO for region detection, Paddle-based OCR, and a vLLM-initialized model. There are hints of Qwen in the stack, and improvements are associated with work around the Intern family of models.

How much VRAM do I need?

It varies by document size and complexity. In testing with large, multi-page documents, VRAM peaks were slightly above 25 GB on a 48 GB GPU. Smaller cases need much less. CPU runs avoid VRAM constraints but are slower.

Conclusion

MinerU 2.5 with vLLM is a strong choice for converting PDFs into structured, reusable data. It preserves document layout, produces clean Markdown/JSON, and includes accurate table and formula conversions (HTML and LaTeX). With OCR across 84 languages, it can recover text from scanned or noisy pages.

The vLLM integration brings a clear speed boost compared with earlier setups, reducing friction for production workflows. In testing, English and Chinese were particularly strong, with European languages generally solid. Arabic, Hindi, and Urdu did not yield usable text; Indonesian was mixed. For many technical, business, and academic documents, MinerU’s structure retention and export fidelity are exactly what’s needed for downstream processing and dataset creation.

You can run MinerU on Linux, Windows, or macOS, on CPU or GPU. The project offers a web UI, CLI, SDK, and a hosted option. If you want local-only processing, the fully private workflow is straightforward. With vLLM in place, MinerU 2.5 is a practical tool for fast, reliable, structure-aware PDF extraction.

Related Posts