Nemotron Nano v2 by NVIDIA Locally

Introduction

NVIDIA’s new vision-language model can read documents, analyze images, understand videos, and answer questions about them all while running locally on a single GPU. In this guide, I install Nemotron Nano v2 12B VL, run it on my own system, and test it across a range of visual tasks to see how it performs in practice.

I’ll show where to get the model, how to serve it with vLLM, how much VRAM it consumes, and how to run both image and video inference. Along the way, I’ll highlight prompt tips and accuracy observations.

What is Nemotron Nano v2?

Nemotron Nano v2 12B VL is NVIDIA’s vision-language model designed to run locally. It can process up to four high-resolution images at once, analyze video content, and handle long context lengths for complex instructions and multi-turn interactions.

Under the hood, it combines a 12.66B-parameter transformer with a vision encoder and a Nemotron 5.5H 12B language backbone. It uses a multi-tile image approach to handle diverse aspect ratios and large images.

Why it stands out

Runs locally on a single GPU.
Accepts high-resolution images (up to four at a time) and video frames.
Supports a 128k token context window for long prompts and transcripts.
Multi-tile image handling for documents, panoramas, and varied layouts.

Nemotron Nano v2 Overview

Item	Details
Model name	Nemotron Nano v2 12B VL
Type	Vision-Language (image + video understanding)
Parameters	12.66B
Context window	128k tokens
Image inputs	Up to 4 high-resolution images at once
Image tiling	512 x 512 tiles, up to 12 tiles per image
Panoramas	Handles wide layouts (e.g., ~3072 x 1024)
Vision encoder	Dedicated vision encoder paired with language backbone
Language backbone	Nemotron 5.5H, 12B
Language support (observed)	English for OCR at the moment
Download sources	Hugging Face, NVIDIA Cloud
Disk size (approx.)	~24.5 GB
Serving stack	vLLM
Test GPU	NVIDIA H100 80 GB
Observed VRAM usage	~75 GB during inference

Key Features of Nemotron Nano v2 12B VL

Multi-image context: Analyze up to four high-resolution images together.
Video understanding: Extract frames and produce structured answers about actions, objects, and scenes.
Long prompts: 128k tokens for instructions, transcripts, and multi-step queries.
Multi-tile vision: Breaks images into 512 x 512 tiles (up to 12 per image) for documents, charts, and wide scenes.
Local-first: Serve on your own hardware with vLLM and keep data on-prem.

Nemotron Nano v2 Download and Setup

Nemotron Nano v2 12B VL is available on Hugging Face and NVIDIA Cloud. After download, the model footprint is about 24.5 GB on disk.

I run it on Ubuntu with an NVIDIA H100 (80 GB VRAM). vLLM is the serving engine. It excels at high-throughput inference and handles the scheduling and batching under the hood.

What you’ll need

Linux host with an NVIDIA GPU (tested on H100 80 GB VRAM).
NVIDIA drivers and CUDA installed.
Python environment for vLLM and client scripts.
Network access to download model weights.

Step-by-Step Nemotron Nano v2 Install

1) Install vLLM

Create and activate a Python environment.
Install vLLM and any required extras.

pip install vllm

2) Download the model

Download from Hugging Face or NVIDIA Cloud to a local path, such as /models/nemotron-nano-v2-12b-vl.

3) Start the server

Serve the model locally on port 8000 with vLLM.

python -m vllm.entrypoints.api_server \
  --model /models/nemotron-nano-v2-12b-vl \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 131072

Adjust max model length and other flags to match your GPU and workload.

4) Confirm it’s running

The log will show the model loading and memory allocation.
The server will bind to localhost:8000 and accept requests.

Hardware and VRAM Notes

On the H100 80 GB, I observed about 75 GB of VRAM used during inference when loading the vision-language model and running tasks. Your usage will vary based on precision settings, batch sizes, context length, and concurrent requests.

If you need tighter memory control, lower the context window, reduce batch size, or use memory-related server flags to cap device usage.

Inference Setup

For images, I used Open WebUI connected to the vLLM endpoint. For videos, I used a simple Python script that reads the video, extracts frames, and sends them to the API with relevant prompts.

This split works well: the UI gives fast iteration for image OCR and question answering, while the script gives fine control for video frame sampling and multi-query evaluation.

Nemotron Nano v2 Image Tests and Results

OCR on typed text (English)

I submitted a clean, typed image and asked the model to perform OCR in English. The model provided a brief description of the image and then transcribed the text. The results were accurate and consistent with the content.

Observed behavior:

One-line scene description followed by extracted text.
Clean OCR with correct punctuation and wording.
English support worked reliably for printed text.

OCR on handwritten text

I tested a handwritten note with strikethroughs and varied spacing. At the same time, I monitored VRAM usage with nvtop, which showed the model consuming over 75 GB when fully loaded.

Results:

The model captured most of the handwritten content correctly, including punctuation and many subtle marks.
It made a small error on a word choice but preserved much of the structure and intent.
With prompt tweaks (e.g., “extract all text exactly as written; indicate crossed-out words”), accuracy improves.

Invoice extraction

I uploaded a standard invoice and asked direct questions such as the total amount due. The model returned the exact amount.

Prompting tip:

Keep queries targeted (“What is the total amount due?” “What is the invoice date?”).
Avoid vague instructions; specific fields yield the best results.

Visual Question Answering: Traffic scenario

I presented a scene with a car and a traffic light and asked if the car was violating any rules. The model reasoned about the light’s position and the car’s apparent motion, concluding that the vehicle had likely passed the light before it turned red. This matched the visible situation.

Satellite image: Terrain and setting

I provided a satellite image showing a water body and surrounding land. When asked about legal permissions for building, the model correctly noted it lacked that information. When asked targeted questions about terrain characteristics, it identified features such as arid or dry regions, sandy plains, river banks, and variable topography. Focused prompts produced useful, relevant descriptions.

Chart analysis

I submitted a chart labeled with EV-related categories and asked for analysis. The model recognized the topic, read the labels, and summarized the chart structure. OCR on the chart elements was accurate, and the model produced a coherent overview.

Prompting Tips for Images

Be specific: Ask for exact fields, values, or structured outputs.
State language expectations: If you need English OCR, say so.
For OCR, request layout-aware extraction: Headings, bullet points, tables.
For reasoning tasks, set the scope: “Describe terrain types in this image” vs. broad asks.

Targeted prompts reduced mistakes and produced sharper, more relevant outputs.

Video Inference: Setup and Method

I used a Python script to:

Load a local video file.
Extract frames at a chosen interval.
Send frames to the vLLM endpoint with a sequence of prompts.
Aggregate model responses into a summary.

This approach gives control over frame sampling and lets you ask multiple questions about the same clip without reprocessing the entire video each time.

Video Test 1: Two kangaroos in a boxing ring

I ran an 8-second clip showing two kangaroos in a ring. The script extracted frames and sent them to the model.

Results:

“What is happening?” The model described two kangaroos facing each other with boxing gloves, in a staged setting.
“What animal is shown?” It identified two kangaroos and counted them correctly.
“Describe the environment.” It noted a spotlight, dim background, and spectators.
Summary: It produced a coherent overview of actions, setting, and objects.

The model handled object identification, counting, and scene description well.

Video Test 2: Couple walking in a forest

I used a clip of a couple walking through an autumn forest in traditional clothing.

Initial results:

The model described the walk, outfits, and fall leaves accurately.
It incorrectly mentioned a scarf being released. This was not present in the clip.

Prompt correction:

When I changed the question to a targeted query (“Is there any animal in the video?”), it answered correctly (“No, there is no animal visible”).
Follow-up descriptions were more aligned with the actual content when prompts were focused.

Takeaway:

Targeted prompts reduce hallucinations and keep the model grounded in the visible content.
Clear, scoped questions outperform open-ended asks when precision matters.

Practical Guidance for Running Locally

Serving tips

Keep the model path local and fast (NVMe if possible) to speed up loading.
Set a context length appropriate for your task. For OCR and multi-turn analysis, 128k is helpful; for quick tasks, smaller is fine.
If you hit memory ceilings, lower max tokens per request, reduce batch size, or limit concurrent clients.

Image workflow

For OCR: Provide clean images and specify “extract exact text.” For forms, ask for a structured output (e.g., key-value pairs).
For charts: Request titles, axes, legends, and a short summary.
For visual QA: Avoid ambiguous phrasing. Reference objects by position or color if needed.

Video workflow

Pick a reasonable frame sampling rate to balance coverage and speed.
Ask layered questions: actions, objects, count, environment, then summary.
If the model drifts, reset with a specific, verifiable question.

Troubleshooting and Observations

Language support: English OCR worked well in these tests. If you need other languages, verify support first.
Memory usage: Expect high VRAM consumption on large vision-language models, especially with long contexts and multi-image inputs.
Accuracy: OCR on typed text is strong; handwritten OCR is competent with minor errors. Charts and documents are handled well. For complex scenes, tighter prompts yield better results.
Hallucinations: If the question implies something that is not present, the model can follow the lead. Ask neutral, verifiable questions and avoid embedding assumptions.

Example Client Snippets

Below are minimal examples to query the running vLLM server. Adapt paths and endpoints to your setup.

Image OCR request (pseudo-JSON)

{
  "model": "nemotron-nano-v2-12b-vl",
  "prompt": "Extract all English text from this image. Keep punctuation.",
  "images": ["file:///path/to/image.png"],
  "max_tokens": 512
}

Video frame sequence (conceptual flow)

# Pseudocode
frames = extract_frames("/path/to/video.mp4", every_n_frames=10)
for q in [
    "What is happening in this video?",
    "Which animals or people are present? Count them.",
    "Describe the environment and setting.",
    "Summarize the video in 3 sentences."
]:
    resp = send_to_vllm(model="nemotron-nano-v2-12b-vl",
                        images=frames, prompt=q, max_tokens=512)
    print(q, resp)

Measured Results Summary

Images

Typed OCR: Accurate transcription and layout cues.
Handwritten OCR: Good overall with a few misreads; improved with stricter prompts.
Invoice fields: Correct extraction for totals and targeted fields.
Traffic scene: Sensible reasoning aligned with the frame.
Satellite image: Useful terrain descriptions with focused questions.
Chart: Correct title/labels and coherent summary.

Videos

Kangaroo clip: Accurate action, object identification, and environment details.
Forest couple: Mostly accurate; one incorrect detail fixed by reframing the question.

Recommendations

Use targeted prompts. Specify the exact fields, counts, or facts you want.
For OCR-heavy tasks, request structured outputs (key-value, bullet lists).
Monitor VRAM. Plan batch sizes and context length according to your GPU.
For video, define a questioning sequence and avoid leading assumptions.

Conclusion

Nemotron Nano v2 12B VL delivers strong results for local image and video understanding. It read documents, handled charts, managed handwritten OCR, and produced consistent video descriptions. It ran fully on a single GPU, and while it required substantial VRAM in this setup, it stayed responsive under load.

The key to high-quality outputs was precise prompting. With focused instructions, the model produced accurate, useful answers across OCR, visual QA, and video summaries. If you need local, long-context vision-language capabilities, this model is a solid option to deploy and evaluate on your own hardware.