PaddleOCR‑VL: How to run Ultra‑Compact Multilingual Document OCR

Table Of Content
- What Is PaddleOCR-VL?
- PaddleOCR-VL Overview
- Table: PaddleOCR-VL at a Glance
- Key Features of PaddleOCR-VL
- Setup and Installation of PaddleOCR-VL
- Environment and Requirements
- Install Prerequisites
- Launch Jupyter Notebook
- Loading the Model
- Tasks I Tested
- Multilingual OCR: Initial Test
- Test Setup
- Observations
- Arabic OCR
- Test Setup
- Memory Footprint and Speed
- Quality
- French Document OCR
- Test Setup
- Speed and Accuracy
- Hindi OCR
- Test Setup
- Quality
- Persian OCR
- Test Setup
- Quality
- Swahili OCR
- Test Setup
- Quality
- Urdu OCR
- Test Setup
- Quality
- Indonesian OCR
- Test Setup
- Quality
- Table Recognition: Invoice
- Test Setup
- Observations
- Handwriting Recognition
- Test Setup
- Quality
- Formula Extraction
- Test Setup
- Observations
- Chart Understanding
- Test Setup
- Results
- Performance and Resource Use
- Speed
- Memory
- Practical Notes on Usage
- Prompts and Tasks
- Post-Processing
- File Handling
- Strengths and Gaps
- What Works Well
- What Needs Improvement
- Step-by-Step: End-to-End Inference
- Examples:
- Summary and Verdict
PaddleOCR-VL is a compact, document-focused vision-language model designed to read full pages, extract text, and understand layouts, tables, charts, and formulas. It pairs a Navit-style dynamic-resolution vision encoder with ERNIE (approximately 3B parameters) as the language component, while the overall model remains ultra-compact at around 0.9B parameters. It supports 109 languages, which is the main focus of my evaluation here.
In this article, I walk through installation, the tasks it supports, and a series of practical tests. I cover multilingual OCR quality, table recognition, handwriting, formula extraction, and chart understanding. I also note performance and memory footprint so you can gauge local deployment.
My goal is to show what it can do right now, highlight strengths and gaps, and provide a concise setup guide for running it locally on CPU or GPU.
What Is PaddleOCR-VL?
PaddleOCR-VL is a multilingual document OCR model that integrates page-level vision with a compact language model to interpret complex document elements. It reads text across diverse scripts, understands document structure, and supports specialized tasks such as formula and chart extraction. The emphasis is on running locally with a small footprint, fast inference, and wide language coverage.
PaddleOCR-VL Overview
PaddleOCR-VL focuses on five core tasks:
- OCR for printed text across 109 languages
- Table recognition for extracting structured content
- Chart understanding for text and numeric values
- Formula extraction, including LaTeX-like expressions
- Handwriting transcription for simple handwritten notes
It aims to balance speed and size with acceptable accuracy in many languages and document types. The model can be loaded with standard Python tools and tested via Jupyter notebooks for interactive workflows.
Table: PaddleOCR-VL at a Glance
| Attribute | Details |
|---|---|
| Model Type | Multilingual vision-language model for documents |
| Size | ~0.9B parameters |
| Vision Encoder | Navit-style dynamic-resolution encoder |
| Language Component | ERNIE (~3B parameters) |
| Supported Languages | 109 |
| Primary Tasks | OCR, table recognition, chart understanding, formula extraction, handwriting |
| Hardware | CPU or GPU; runs comfortably with low VRAM |
| Observed VRAM (GPU) | ~2.5 GB during tests |
| Typical Use Cases | Multilingual document OCR, data extraction, basic layout understanding |
| Deployment | Local, via Python and Transformers |
Note: The table reflects observations during local testing and the information provided in the model description.
Key Features of PaddleOCR-VL
- Multilingual OCR across 109 languages, spanning Latin and non-Latin scripts
- Page-level understanding with support for tables, charts, and formulas
- Compact size for local deployment on modest hardware
- Fast inference relative to model size
- Low memory footprint, suitable for CPU and GPU use
- Task-specific prompts for OCR, table parsing, chart reading, and formula extraction
Setup and Installation of PaddleOCR-VL
Environment and Requirements
I ran the tests on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM). The model is small enough that a GPU is not required; CPU-only inference is feasible. The main dependencies are PyTorch, Transformers, and a few standard Python libraries.
Install Prerequisites
- Python 3.9+ recommended
- pip or conda environment
Step-by-step installation:
- Create and activate a virtual environment (optional but recommended).
- Install PyTorch and Transformers.
- Install Jupyter for interactive testing.
Example commands:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers pillow jupyter numpyFor CPU-only environments, install the CPU build of PyTorch instead:
pip install torch torchvision torchaudioLaunch Jupyter Notebook
- Start Jupyter and create a new Python notebook:
jupyter notebook- Within the notebook, import the required libraries and load the model. Use the appropriate model class for vision-to-text with image inputs.
Loading the Model
Below is a general loading pattern with Transformers. Replace the placeholders with the actual PaddleOCR-VL model identifiers documented by the maintainers.
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "paddleocr-vl-model-id" # replace with actual model repository
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id).to(device)For task-specific instructions, you can control prompts or task flags during preprocessing or generation. The tests below show the tasks in the order I evaluated them.
Tasks I Tested
- OCR for printed text in multiple languages
- Table recognition in invoices and structured documents
- Handwriting transcription
- Formula extraction
- Chart analysis
In each case, I used a local image, encoded it, specified a task (e.g., “ocr”, “table”, “formula”, “chart”), and then generated the output for comparison.
Multilingual OCR: Initial Test
Test Setup
I began with a single image containing a multilingual paragraph spanning languages from Europe and Asia. The task was set to OCR. The idea was to see general recognition quality across scripts in one pass.
Example inference:
image = Image.open("multilingual_paragraph.jpg").convert("RGB")
inputs = processor(images=image, text="Task: ocr", return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024)
result = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(result)Observations
- English and several European languages (including German) were accurate.
- Chinese text appeared correctly recognized on visual inspection.
- Russian showed noticeable errors.
- Arabic and Hindi content in the composite image was either missing or not recognized well enough to judge.
This motivated targeted tests with individual scripts to assess quality more clearly.
Arabic OCR
Test Setup
I used an Arabic-only image with both text and some embedded visuals. The goal was to see isolated script performance and measure resource usage.
Memory Footprint and Speed
- VRAM hovered around ~2.5 GB on GPU during inference.
- Performance was responsive for a compact model; CPU inference is also feasible.
Quality
- Visual inspection suggested solid character recognition.
- Diacritic fidelity was mixed (e.g., some double dots appeared as single).
- Overall, the Arabic output was promising for a 0.9B model, with room for refinement in finer script details.
French Document OCR
Test Setup
I tested a French driving license (synthetic). The task remained OCR.
Speed and Accuracy
- Inference speed was good.
- Main fields were captured well.
- Small, fine-print elements were not fully captured.
- Overall, acceptable results for general text; very small fonts were a challenge.
Hindi OCR
Test Setup
I ran OCR on a Hindi text image to evaluate recognition for an Indic script.
Quality
- Inference took slightly longer relative to Latin text.
- Visual character matching suggested reasonable recognition.
- A detailed linguistic verification would be helpful for precise error rates, but the general structure appeared intact.
Persian OCR
Test Setup
I evaluated a Persian text image to compare performance across related scripts.
Quality
- Visual inspection suggested text was captured, but a detailed native-level review would be required for accuracy assessment.
- Baseline recognition appeared plausible without clear indicators of systematic failure.
Swahili OCR
Test Setup
Swahili text was included as part of Latin-script evaluation.
Quality
- Recognition quality was strong, consistent with Latin-script performance noted earlier.
- Text output aligned well with the source image.
Urdu OCR
Test Setup
I ran OCR on an Urdu text image to probe recognition for a complex cursive script.
Quality
- Output quality was weaker than for Arabic.
- Issues included unjoined words and missing ligatures in multiple places.
- This indicates meaningful gaps for Urdu in its current form.
Indonesian OCR
Test Setup
I tested Indonesian (Bahasa) text to further check Latin-script performance.
Quality
- Output was accurate and consistent with earlier Latin-script results.
- Straightforward printed text in Latin script appears to be a strong suit for the model.
Table Recognition: Invoice
Test Setup
I used an invoice image and set the task to table recognition. The goal was to extract structured content and assess how well the model identifies key fields.
Example invocation:
image = Image.open("invoice.jpg").convert("RGB")
inputs = processor(images=image, text="Task: table", return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024)
table_output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(table_output)Observations
- The model detected the presence of a logo area (“logo goes here”).
- It missed a watermark.
- Address blocks and header fields were recognized acceptably.
- Extracted table text was usable, though not fully reconstructed into a neat table format.
You can post-process the extracted content into structured formats (e.g., Markdown, CSV, or JSON) using standard Python tooling. As a compact VLM, the model provides the raw text with reasonable structure cues, but it may not output a perfectly formatted table in one step.
Handwriting Recognition
Test Setup
I evaluated a simple handwritten note image with the OCR task.
Quality
- The model produced clear transcriptions for straightforward handwriting.
- For neat, legible handwriting, results were good for basic notes; irregular or stylized handwriting was not part of this test.
Formula Extraction
Test Setup
I tested a formula-heavy image first with OCR, then with a task prompt focused on formulas. The goal was to see if the model extracts math expressions robustly and preserves components that could be rendered later.
Example prompt variant:
image = Image.open("formula.jpg").convert("RGB")
inputs = processor(images=image, text="Task: formula", return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024)
formula_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(formula_text)Observations
- The model extracted formula text components effectively.
- While it may not emit fully formatted LaTeX for every element, the key symbols and structure were present.
- With minor post-processing, the output can be rendered or converted to LaTeX-like syntax.
Chart Understanding
Test Setup
I tested a chart image to evaluate label recognition and numeric value extraction.
Results
- The model identified chart items and extracted numerical values well.
- X-axis and Y-axis labels were read accurately.
- A small label in the image was missed, indicating sensitivity to tiny text or atypical placement.
- Overall, chart value extraction was a strong point in this test.
Performance and Resource Use
Speed
- Inference was consistently fast for a 0.9B model across tasks like OCR, tables, formulas, and charts.
- Latin-script documents and simpler tasks ran quickest, while complex scripts and specialized tasks could take slightly longer.
Memory
- On GPU, VRAM usage around ~2.5 GB was typical during tests.
- CPU inference is feasible and practical for many use cases, given the compact size.
Practical Notes on Usage
Prompts and Tasks
- Explicitly specifying the task in the prompt (e.g., “Task: ocr”, “Task: table”, “Task: formula”, “Task: chart”) helps guide the output format.
- For mixed-content documents, run multiple passes with different task prompts or segment the page into regions and process each region with the appropriate task.
Post-Processing
- For tables, consider converting the extracted text into structured rows/columns using regex or layout heuristics.
- For formulas, apply light cleanup to convert to display-ready LaTeX.
- For charts, map extracted numbers to series/axes by matching detected labels and positions if available.
File Handling
- Preprocess images for clarity: deskew, denoise, and increase contrast where appropriate.
- For multi-page PDFs, render pages to images at adequate DPI before inference.
Strengths and Gaps
What Works Well
- Latin-script OCR (English, German, French, Indonesian) was consistently strong.
- Chinese recognition appeared accurate in the composite test.
- Chart value extraction and axis labels were a highlight.
- Handwriting transcription for simple notes was good.
- Formula text extraction contained all essential parts for later formatting.
What Needs Improvement
- Russian had visible errors in the composite test.
- Urdu output indicated joining and ligature issues.
- Tiny text and fine print can be missed, as seen in certain document fields and watermark detection.
- Arabic diacritic fidelity could be improved.
Step-by-Step: End-to-End Inference
Below is a simple flow you can adapt:
-
Install dependencies:
- torch, transformers, pillow, jupyter
-
Load the model and processor:
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "paddleocr-vl-model-id" # replace with actual model repo
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id).to(device)- Choose a task and run inference:
def run_task(image_path, task, max_tokens=1024):
image = Image.open(image_path).convert("RGB")
inputs = processor(images=image, text=f"Task: {task}", return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=max_tokens)
return processor.batch_decode(output_ids, skip_special_tokens=True)[0]
# Examples:
print(run_task("multilingual.jpg", "ocr"))
print(run_task("invoice.jpg", "table"))
print(run_task("formula.jpg", "formula"))
print(run_task("chart.jpg", "chart"))- Post-process as needed:
- Convert table text to CSV/Markdown.
- Clean up formulas for LaTeX rendering.
- Map chart values to series and axes for reports.
Summary and Verdict
PaddleOCR-VL delivers a compact document OCR solution with multilingual support, page-level understanding, and specialized tasks for tables, charts, formulas, and handwriting. It runs locally with modest resources and showed fast inference in my tests. The language coverage is broad, and several scripts performed well, particularly Latin-based languages and Chinese. Arabic recognition was positive, though small diacritic details need refinement. Urdu output requires improvement in joining and ligature handling. Russian showed errors in the initial composite.
For tables and invoices, text extraction is usable, but you may need post-processing to achieve clean, structured outputs. Chart understanding was notably effective in reading values and axes, missing only small labels in some cases. Formula extraction captured essential content suitable for later formatting.
Given the 0.9B size, the model’s performance is commendable. The OCR space has progressed quickly, and expectations for perfect formatting out of the box are rising. Even so, PaddleOCR-VL offers a capable, local-first option with a strong balance of size, speed, and functionality, with clear paths for improvement in certain scripts and small-text fidelity.
Related Posts

ChatGPT Atlas by OpenAI Enters the Browser Wars
Chrome dominates, Edge has Copilot, and Perplexity is building Comet—now OpenAI’s ChatGPT Atlas joins in. What this AI-first browser could mean for the web.

Beyond ChatGPT: DeepAgent, the AI Agent That Works While You Sleep
Discover DeepAgent, the autonomous AI that handles your job overnight. See why tech insiders say it’s beyond ChatGPT and Claude—and how it’s working today.

DeepSeek-OCR (VL2): How to Run Locally for Complex Documents
Discover DeepSeek-OCR (VL2), a vision-language OCR you can run locally for complex documents: layout, tables, charts, and visual Q&A. Learn setup steps and tips.
