DeepSeek-OCR (VL2): How to Run Locally for Complex Documents

DeepSeek has released its first OCR model, referred to as DeepSeek-OCR or DeepSeek-VL2 in some papers. It is a vision-language model aimed at complex document understanding, not just character extraction.

In this guide, I install the model locally, walk through the demo, and test it across varied document types. I also outline its architecture and share observations on accuracy, speed, memory use, and multilingual support.

The model preserves layout, reads structured documents, interprets charts and tables, and can answer visual questions where text extraction matters. It targets practical workflows across forms, invoices, diagrams, and mixed-layout pages.

What Is the DeepSeek-OCR?

The topic is running DeepSeek-OCR (DeepSeek-VL2) locally and evaluating it on complex OCR tasks. I cover setup, model variants, special modes, and architecture choices intended to balance speed, detail, and context. I also assess where it performs well and where it struggles.

DeepSeek-OCR Overview

Item	Details
Model	DeepSeek-OCR (DeepSeek-VL2)
Type	Vision-language OCR with layout, tables, charts, and text-centered VQA
Input	Images of documents, forms, charts, tables, handwritten notes, scans
Output	Plain text, Markdown (including tables), free OCR text
Model Variants	Tiny, Small, Base, Gundam (dynamic-resolution mode)
File Size (download)	Single tensor, just under ~7 GB (on first run)
Typical VRAM Observed	~15 GB when fully loaded on GPU (in my tests)
Multilingual	Strong on many European languages; support for Chinese, Japanese, Korean noted; weak on Arabic and Hindi in my tests
Layout Preservation	Yes (segment-based reasoning and Markdown export)
Chart/Table Reasoning	Yes (chart-to-table conversion worked well in testing)

Key Features of DeepSeek-OCR

Handles complex documents: forms, invoices, structured layouts, and mixed visual contexts.
Preserves layout and exports to Markdown, including tables.
Converts charts into structured tables with correct axis values.
Multilingual capability across many languages, though coverage is uneven.
Offers a dynamic-resolution mode (“Gundam”) for combining fine detail with broader page context.
Efficient inference design with a mixture-of-experts language core and memory-saving attention.

DeepSeek-OCR Local Installation and Setup

I installed and ran the model locally on Ubuntu with a single NVIDIA GPU. The process was straightforward and includes a one-time model download the first time you launch the demo.

System I Used

OS: Ubuntu
GPU: NVIDIA RTX A6000 (48 GB VRAM)
Runtime: Model fully loaded on GPU during tests
Observed VRAM usage: around 15 GB with the chosen variant

Step-by-Step Setup of DeepSeek-OCR

Clone the repository.
Install the Python requirements.
Run the provided script to launch the Gradio demo.
On first launch, the script downloads the model (a single tensor just under ~7 GB).
Open the local URL in your browser and begin testing.

Running the Demo and First Tests of DeepSeek-OCR

Once the model finished downloading, the demo came up in the browser. I selected a sample image, chose the model size, and picked Markdown as the output.

Ad Image to Markdown with Gundam Mode

I started with an ad image and selected the Gundam variant. I asked the model to convert to Markdown rather than plain text.

The output captured text and structure correctly.
Content was segmented into boxes internally, but the final Markdown preserved the layout well.
I did not see errors in the extracted text or structure on this sample.

LaTeX Page OCR and VRAM Usage

Next, I tested a LaTeX-heavy image.

I monitored VRAM during processing; it hovered a touch over 15 GB with the chosen setup.
The output listed computed points, function references, and symbols cleanly.
The formatting remained consistent, and the extracted math-related text looked accurate on quick review.

What Is Gundam Mode?

The “Gundam” option refers to a dynamic-resolution setup described in the paper. In practice, it assembles multiple tiles at a moderate resolution and combines them with one higher-resolution tile.

Multiple image tiles are processed around 640 × 640 resolution.
A single higher-resolution tile near 1024 square is included to capture fine details.
The hybrid approach aims to balance readability of small text with page-level context.

This method is meant to retain small characters and intricate features while maintaining a coherent sense of the full page.

More OCR Trials

I moved on to a series of document types to gauge reliability across handwriting, invoices, multilingual samples, charts, and degraded scans.

Handwriting Extraction

A simple handwritten image was correctly transcribed.

Struck-out text was appropriately ignored.
The output text matched the visible content without odd insertions.

Invoice to Markdown Table

I fed in a typical invoice and requested Markdown output.

The model produced a well-formed Markdown table.
Numeric values and item lines looked correct on a quick scan.
An “NL” token appeared in one spot without clear purpose, but the rest of the table was intact.
Notes and additional fields were captured.

You can run it with transformers or serve it with a VLM stack, depending on your environment.

Multilingual Sample

I tested an image containing multiple languages, including Arabic, Hindi, and several European languages.

The model handled European languages fairly well.
It did not do well on Arabic or Hindi in this test.
Based on this, multilingual support exists but coverage is uneven.

Charts to Tables

I provided a chart to see if the model could convert it into a structured table.

It returned a clean table with correct axis values.
Numeric entries matched the chart’s markers and labels.
This worked consistently and saved manual data entry effort.

Old Newspaper Scan

I tried an extremely degraded newspaper scan that was hard to read even by eye.

The model extracted some text from the top.
In areas where it could not extract text, it produced placeholder markers rather than fabricating content.
It avoided confident guesswork in unreadable regions, which is preferable for noisy sources.

Blueprint Extraction and Architecture Notes

I then tested a technical blueprint. While it processed, I reviewed the architecture details that inform how the model runs.

Mixture-of-Experts Language Core

DeepSeek-OCR uses a mixture-of-experts approach on the language side.

Only a subset of experts activates per token.
This keeps inference efficient with fewer active parameters.
The intent is to retain strong capability without a large burden on every step.

Dynamic Vision Tiling

The vision encoder applies dynamic tiling to accommodate high-resolution pages and unusual aspect ratios.

Inputs are split into tiles to capture fine elements without overloading compute.
The mechanism adapts to large formats and irregular document layouts.
It supports both detail capture and contextual understanding across the full page.

Latent Attention KV Compression

The model employs multi-head latent attention to compress the key-value cache into latent vectors.

This reduces memory overhead during inference.
It helps maintain throughput and scalability for longer or more detailed documents.
The approach is suited for processing images at scale where cache size can otherwise become a bottleneck.

Blueprint Result

The blueprint run took longer than other samples. In this test, the output returned as a blank image.

This suggests a failure case on the chosen sample and settings.
Other structured document types performed better in comparison.

ID Document OCR (Free OCR Mode)

I tested a synthetic driving license using free OCR mode.

The output text was coherent and matched the visible fields.
The sample appeared to be German, and on quick inspection the fields aligned.

Observations on Performance

The model aims to understand context while preserving layout. On many document-centric tasks, it performed well. On certain difficult images, it struggled to recover content cleanly.

Layout reconstruction to Markdown is one of its strengths.
Chart-to-table conversion was consistently accurate in my runs.
LaTeX-heavy material was parsed reliably with correct structure.

Memory use was predictable, and the downloadable model size stayed manageable for a local setup with a modern GPU.

Practical Usage Notes

For structured documents (invoices, forms), choose Markdown output to retain layout and tables.
For charts, expect accurate tabular conversion that saves manual data entry.
For noisy scans or historical materials, expect partial recovery and cautious handling of unreadable areas.
Gundam mode can help when you need both small-text clarity and page context.

Quick Reference: Settings and Outputs

Modes: Markdown extraction, free OCR text.
Model variants: Tiny, Small, Base, Gundam (dynamic resolution).
Expected VRAM (varies by system): around 15 GB in my tests.
First-run download: single model tensor just under ~7 GB.

Strengths and Limitations

Strengths

Solid OCR with layout awareness and Markdown export.
Robust table extraction and reliable chart-to-table conversion.
Clear handling of mathematical and structured text on LaTeX pages.
Sensible behavior on unreadable regions without inventing content.
Efficient inference design with mixture-of-experts and memory-aware attention.

Limitations

Struggles on very difficult images such as heavily degraded scans and some technical drawings.
Multilingual coverage appears limited for certain scripts, with weak results on Arabic and Hindi in my tests.
Occasional stray tokens (e.g., “NL”) may appear in outputs.

Step-by-Step: My End-to-End Workflow

Prepare the system
- Ensure a compatible GPU and drivers.
- Have Python and required dependencies ready.
Install the repository
- Clone the repo.
- Install the requirements with your package manager or pip.
Launch the demo
- Run the provided script.
- Allow the initial model download to complete.
Open the interface
- Access the local URL in your browser.
- Select a model variant (I used Gundam for detailed pages).
Choose output format
- Pick Markdown for layout-aware exports.
- Use free OCR for plain text.
Test varied inputs
- Try ads, invoices, charts, LaTeX pages, handwritten notes, and scans.
- Compare outputs to source for structure and accuracy.
Monitor resources
- Keep an eye on VRAM and performance during heavy tasks.
- Adjust model variant as needed for your hardware.

Tips for Better Outcomes

Use high-quality images with good lighting and resolution when possible.
For complex documents, select Markdown to retain structure.
If a sample fails, try a different variant or reduce image size to see if tiling improves stability.
For multilingual tasks, validate outputs carefully on scripts that appear less supported.

Final Thoughts

DeepSeek-OCR targets real document workflows by going beyond plain text extraction. It reads structure, preserves layout, and translates charts into tables. The architecture choices—mixture-of-experts on the language side, dynamic tiling for vision, and latent attention cache compression—aim to balance capability with compute efficiency.

In my testing, its strengths are clear on invoices, charts, and formatted pages. It also performs well on ordinary printed text and simple handwriting. It struggles on very hard images and shows limited coverage for certain language scripts. Within those bounds, it is a capable local OCR solution with practical export options and predictable resource use.

If you need layout-aware OCR, chart-to-table conversion, and solid Markdown outputs, it is worth setting up and testing on your own documents to see how it handles your specific formats and languages.