Sonu Sahani logo
Sonusahani.com
AI

PaddleOCR‑VL: How to run Ultra‑Compact Multilingual Document OCR

PaddleOCR‑VL: How to run Ultra‑Compact Multilingual Document OCR
0 views
12 min read
#AI

PaddleOCR-VL is a compact, document-focused vision-language model designed to read full pages, extract text, and understand layouts, tables, charts, and formulas. It pairs a Navit-style dynamic-resolution vision encoder with ERNIE (approximately 3B parameters) as the language component, while the overall model remains ultra-compact at around 0.9B parameters. It supports 109 languages, which is the main focus of my evaluation here.

In this article, I walk through installation, the tasks it supports, and a series of practical tests. I cover multilingual OCR quality, table recognition, handwriting, formula extraction, and chart understanding. I also note performance and memory footprint so you can gauge local deployment.

My goal is to show what it can do right now, highlight strengths and gaps, and provide a concise setup guide for running it locally on CPU or GPU.

What Is PaddleOCR-VL?

PaddleOCR-VL is a multilingual document OCR model that integrates page-level vision with a compact language model to interpret complex document elements. It reads text across diverse scripts, understands document structure, and supports specialized tasks such as formula and chart extraction. The emphasis is on running locally with a small footprint, fast inference, and wide language coverage.

PaddleOCR-VL Overview

PaddleOCR-VL focuses on five core tasks:

  • OCR for printed text across 109 languages
  • Table recognition for extracting structured content
  • Chart understanding for text and numeric values
  • Formula extraction, including LaTeX-like expressions
  • Handwriting transcription for simple handwritten notes

It aims to balance speed and size with acceptable accuracy in many languages and document types. The model can be loaded with standard Python tools and tested via Jupyter notebooks for interactive workflows.

Table: PaddleOCR-VL at a Glance

AttributeDetails
Model TypeMultilingual vision-language model for documents
Size~0.9B parameters
Vision EncoderNavit-style dynamic-resolution encoder
Language ComponentERNIE (~3B parameters)
Supported Languages109
Primary TasksOCR, table recognition, chart understanding, formula extraction, handwriting
HardwareCPU or GPU; runs comfortably with low VRAM
Observed VRAM (GPU)~2.5 GB during tests
Typical Use CasesMultilingual document OCR, data extraction, basic layout understanding
DeploymentLocal, via Python and Transformers

Note: The table reflects observations during local testing and the information provided in the model description.

Key Features of PaddleOCR-VL

  • Multilingual OCR across 109 languages, spanning Latin and non-Latin scripts
  • Page-level understanding with support for tables, charts, and formulas
  • Compact size for local deployment on modest hardware
  • Fast inference relative to model size
  • Low memory footprint, suitable for CPU and GPU use
  • Task-specific prompts for OCR, table parsing, chart reading, and formula extraction

Setup and Installation of PaddleOCR-VL

Environment and Requirements

I ran the tests on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM). The model is small enough that a GPU is not required; CPU-only inference is feasible. The main dependencies are PyTorch, Transformers, and a few standard Python libraries.

Install Prerequisites

  • Python 3.9+ recommended
  • pip or conda environment

Step-by-step installation:

  1. Create and activate a virtual environment (optional but recommended).
  2. Install PyTorch and Transformers.
  3. Install Jupyter for interactive testing.

Example commands:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers pillow jupyter numpy

For CPU-only environments, install the CPU build of PyTorch instead:

pip install torch torchvision torchaudio

Launch Jupyter Notebook

  • Start Jupyter and create a new Python notebook:
jupyter notebook
  • Within the notebook, import the required libraries and load the model. Use the appropriate model class for vision-to-text with image inputs.

Loading the Model

Below is a general loading pattern with Transformers. Replace the placeholders with the actual PaddleOCR-VL model identifiers documented by the maintainers.

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "paddleocr-vl-model-id"  # replace with actual model repository
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id).to(device)

For task-specific instructions, you can control prompts or task flags during preprocessing or generation. The tests below show the tasks in the order I evaluated them.

Tasks I Tested

  • OCR for printed text in multiple languages
  • Table recognition in invoices and structured documents
  • Handwriting transcription
  • Formula extraction
  • Chart analysis

In each case, I used a local image, encoded it, specified a task (e.g., “ocr”, “table”, “formula”, “chart”), and then generated the output for comparison.

Multilingual OCR: Initial Test

Test Setup

I began with a single image containing a multilingual paragraph spanning languages from Europe and Asia. The task was set to OCR. The idea was to see general recognition quality across scripts in one pass.

Example inference:

image = Image.open("multilingual_paragraph.jpg").convert("RGB")
inputs = processor(images=image, text="Task: ocr", return_tensors="pt").to(device)
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024)
result = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(result)

Observations

  • English and several European languages (including German) were accurate.
  • Chinese text appeared correctly recognized on visual inspection.
  • Russian showed noticeable errors.
  • Arabic and Hindi content in the composite image was either missing or not recognized well enough to judge.

This motivated targeted tests with individual scripts to assess quality more clearly.

Arabic OCR

Test Setup

I used an Arabic-only image with both text and some embedded visuals. The goal was to see isolated script performance and measure resource usage.

Memory Footprint and Speed

  • VRAM hovered around ~2.5 GB on GPU during inference.
  • Performance was responsive for a compact model; CPU inference is also feasible.

Quality

  • Visual inspection suggested solid character recognition.
  • Diacritic fidelity was mixed (e.g., some double dots appeared as single).
  • Overall, the Arabic output was promising for a 0.9B model, with room for refinement in finer script details.

French Document OCR

Test Setup

I tested a French driving license (synthetic). The task remained OCR.

Speed and Accuracy

  • Inference speed was good.
  • Main fields were captured well.
  • Small, fine-print elements were not fully captured.
  • Overall, acceptable results for general text; very small fonts were a challenge.

Hindi OCR

Test Setup

I ran OCR on a Hindi text image to evaluate recognition for an Indic script.

Quality

  • Inference took slightly longer relative to Latin text.
  • Visual character matching suggested reasonable recognition.
  • A detailed linguistic verification would be helpful for precise error rates, but the general structure appeared intact.

Persian OCR

Test Setup

I evaluated a Persian text image to compare performance across related scripts.

Quality

  • Visual inspection suggested text was captured, but a detailed native-level review would be required for accuracy assessment.
  • Baseline recognition appeared plausible without clear indicators of systematic failure.

Swahili OCR

Test Setup

Swahili text was included as part of Latin-script evaluation.

Quality

  • Recognition quality was strong, consistent with Latin-script performance noted earlier.
  • Text output aligned well with the source image.

Urdu OCR

Test Setup

I ran OCR on an Urdu text image to probe recognition for a complex cursive script.

Quality

  • Output quality was weaker than for Arabic.
  • Issues included unjoined words and missing ligatures in multiple places.
  • This indicates meaningful gaps for Urdu in its current form.

Indonesian OCR

Test Setup

I tested Indonesian (Bahasa) text to further check Latin-script performance.

Quality

  • Output was accurate and consistent with earlier Latin-script results.
  • Straightforward printed text in Latin script appears to be a strong suit for the model.

Table Recognition: Invoice

Test Setup

I used an invoice image and set the task to table recognition. The goal was to extract structured content and assess how well the model identifies key fields.

Example invocation:

image = Image.open("invoice.jpg").convert("RGB")
inputs = processor(images=image, text="Task: table", return_tensors="pt").to(device)
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024)
table_output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(table_output)

Observations

  • The model detected the presence of a logo area (“logo goes here”).
  • It missed a watermark.
  • Address blocks and header fields were recognized acceptably.
  • Extracted table text was usable, though not fully reconstructed into a neat table format.

You can post-process the extracted content into structured formats (e.g., Markdown, CSV, or JSON) using standard Python tooling. As a compact VLM, the model provides the raw text with reasonable structure cues, but it may not output a perfectly formatted table in one step.

Handwriting Recognition

Test Setup

I evaluated a simple handwritten note image with the OCR task.

Quality

  • The model produced clear transcriptions for straightforward handwriting.
  • For neat, legible handwriting, results were good for basic notes; irregular or stylized handwriting was not part of this test.

Formula Extraction

Test Setup

I tested a formula-heavy image first with OCR, then with a task prompt focused on formulas. The goal was to see if the model extracts math expressions robustly and preserves components that could be rendered later.

Example prompt variant:

image = Image.open("formula.jpg").convert("RGB")
inputs = processor(images=image, text="Task: formula", return_tensors="pt").to(device)
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024)
formula_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(formula_text)

Observations

  • The model extracted formula text components effectively.
  • While it may not emit fully formatted LaTeX for every element, the key symbols and structure were present.
  • With minor post-processing, the output can be rendered or converted to LaTeX-like syntax.

Chart Understanding

Test Setup

I tested a chart image to evaluate label recognition and numeric value extraction.

Results

  • The model identified chart items and extracted numerical values well.
  • X-axis and Y-axis labels were read accurately.
  • A small label in the image was missed, indicating sensitivity to tiny text or atypical placement.
  • Overall, chart value extraction was a strong point in this test.

Performance and Resource Use

Speed

  • Inference was consistently fast for a 0.9B model across tasks like OCR, tables, formulas, and charts.
  • Latin-script documents and simpler tasks ran quickest, while complex scripts and specialized tasks could take slightly longer.

Memory

  • On GPU, VRAM usage around ~2.5 GB was typical during tests.
  • CPU inference is feasible and practical for many use cases, given the compact size.

Practical Notes on Usage

Prompts and Tasks

  • Explicitly specifying the task in the prompt (e.g., “Task: ocr”, “Task: table”, “Task: formula”, “Task: chart”) helps guide the output format.
  • For mixed-content documents, run multiple passes with different task prompts or segment the page into regions and process each region with the appropriate task.

Post-Processing

  • For tables, consider converting the extracted text into structured rows/columns using regex or layout heuristics.
  • For formulas, apply light cleanup to convert to display-ready LaTeX.
  • For charts, map extracted numbers to series/axes by matching detected labels and positions if available.

File Handling

  • Preprocess images for clarity: deskew, denoise, and increase contrast where appropriate.
  • For multi-page PDFs, render pages to images at adequate DPI before inference.

Strengths and Gaps

What Works Well

  • Latin-script OCR (English, German, French, Indonesian) was consistently strong.
  • Chinese recognition appeared accurate in the composite test.
  • Chart value extraction and axis labels were a highlight.
  • Handwriting transcription for simple notes was good.
  • Formula text extraction contained all essential parts for later formatting.

What Needs Improvement

  • Russian had visible errors in the composite test.
  • Urdu output indicated joining and ligature issues.
  • Tiny text and fine print can be missed, as seen in certain document fields and watermark detection.
  • Arabic diacritic fidelity could be improved.

Step-by-Step: End-to-End Inference

Below is a simple flow you can adapt:

  1. Install dependencies:

    • torch, transformers, pillow, jupyter
  2. Load the model and processor:

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "paddleocr-vl-model-id"  # replace with actual model repo
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id).to(device)
  1. Choose a task and run inference:
def run_task(image_path, task, max_tokens=1024):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, text=f"Task: {task}", return_tensors="pt").to(device)
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=max_tokens)
    return processor.batch_decode(output_ids, skip_special_tokens=True)[0]

# Examples:
print(run_task("multilingual.jpg", "ocr"))
print(run_task("invoice.jpg", "table"))
print(run_task("formula.jpg", "formula"))
print(run_task("chart.jpg", "chart"))
  1. Post-process as needed:
    • Convert table text to CSV/Markdown.
    • Clean up formulas for LaTeX rendering.
    • Map chart values to series and axes for reports.

Summary and Verdict

PaddleOCR-VL delivers a compact document OCR solution with multilingual support, page-level understanding, and specialized tasks for tables, charts, formulas, and handwriting. It runs locally with modest resources and showed fast inference in my tests. The language coverage is broad, and several scripts performed well, particularly Latin-based languages and Chinese. Arabic recognition was positive, though small diacritic details need refinement. Urdu output requires improvement in joining and ligature handling. Russian showed errors in the initial composite.

For tables and invoices, text extraction is usable, but you may need post-processing to achieve clean, structured outputs. Chart understanding was notably effective in reading values and axes, missing only small labels in some cases. Formula extraction captured essential content suitable for later formatting.

Given the 0.9B size, the model’s performance is commendable. The OCR space has progressed quickly, and expectations for perfect formatting out of the box are rising. Even so, PaddleOCR-VL offers a capable, local-first option with a strong balance of size, speed, and functionality, with clear paths for improvement in certain scripts and small-text fidelity.

Related Posts