Table Of Content
- Installing Molmo 2 The Locally
- VRAM Usage With Molmo 2 The
- Image Tests With Molmo 2 The
- Describing a Temple Image
- Grounded Pointing on a Photosynthesis Diagram
- OCR and Grounding on Handwritten Text
- Converting a Table to JSON
- Quick Multilingual Check With Arabic
- Video Tests With Molmo 2 The
- Cheerleaders Routine Analysis and Pointing
- Relationship Reasoning Without Hallucination
- Kangaroos in a Boxing Ring
- Molmo 2 The Architecture in Simple Words
- Performance Overview
- Final Thoughts
Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking
Table Of Content
- Installing Molmo 2 The Locally
- VRAM Usage With Molmo 2 The
- Image Tests With Molmo 2 The
- Describing a Temple Image
- Grounded Pointing on a Photosynthesis Diagram
- OCR and Grounding on Handwritten Text
- Converting a Table to JSON
- Quick Multilingual Check With Arabic
- Video Tests With Molmo 2 The
- Cheerleaders Routine Analysis and Pointing
- Relationship Reasoning Without Hallucination
- Kangaroos in a Boxing Ring
- Molmo 2 The Architecture in Simple Words
- Performance Overview
- Final Thoughts
When you have a video analysis model with great benchmarks and amazing quality, then Christmas really has come. I installed Molmo 2 locally and tested it on images, videos, and multi-image scenarios. I also describe the architecture in simple words.
![]()
Molmo 2 is a family of openweight vision models from Allen Institute designed to handle images, multi-image sets, and videos with strong grounding capabilities. Grounding in this context refers to the model's ability to connect its answers directly to specific visual evidence in the input rather than just describing what it sees in general terms. It can point to locations by outputting coordinates in images or frames, provide timestamps for events in videos, assign persistent ids to track the same object across frames even if it is temporarily hidden, and tie its reasoning or count to those exact spatial and temporal references.
![]()
Installing Molmo 2 The Locally
![]()
I used an Ubuntu system with 1 GPU card Nvidia H100 with 80 GB of VRAM. Here is how I set it up:
![]()
- Create a virtual environment.
- Install all the prerequisites. Make sure that you use this Transformers version and Mulbo utilities.
- Run the script obtained from their Hugging Face card. I put a Gradio interface on top of it, downloaded the model, and created an interface to play around with images and videos.
- On the first run it downloads the model shards. There are eight shards of it.
- Launch it and access it in the local browser.
![]()
VRAM Usage With Molmo 2 The
![]()
- Images: around 38.4 GB of VRAM. It would easily fit onto a 48 GB card for images.
- Videos: around 43 GB of VRAM. Still under 48 GB, so a 48 GB GPU can fit it in, like an A6000.
![]()
Image Tests With Molmo 2 The
![]()
Describing a Temple Image
![]()
I selected an image of a temple and asked it to describe the image. It responded that the image showcases a magnificent Hindu temple located in Tanjaw Tamil Nadu. I thought this was Madurai temple, but if you are from that region please confirm. It described the architecture really well even though the image was not that clear, calling it an architectural marvel and an important religious site.
![]()
Grounded Pointing on a Photosynthesis Diagram
![]()
I used a photosynthesis diagram and asked it to point to the cloud. It was fast and returned coordinates with clear Target markers. I then asked it to point to the sun, and it correctly returned the target with coordinates.
![]()
This model is open weight, Apache 2 licensed, open dataset, everything is open.
OCR and Grounding on Handwritten Text
![]()
It is not an OCR model as far as I know, but I tested it on a handwritten, AI-generated image and asked it to extract the text. It extracted the text well and noted that the word out appears to be a misspelling of all. I then asked it to point to cross-out words, and it returned multiple Target markers at the right places. Very impressive.
![]()
Converting a Table to JSON
![]()
I asked it to convert a table into JSON. It defined all the columns and rows correctly and produced a clean JSON representation of the table. Wonderfully done.
![]()
Quick Multilingual Check With Arabic
![]()
I tried a quick Arabic OCR test using an old newspaper cutting. It repeated text in places, but it understood that the piece was about diplomatic relations because the leaders from both countries were present. Not bad in terms of understanding.
![]()
Video Tests With Molmo 2 The
![]()
Cheerleaders Routine Analysis and Pointing
![]()
I used an AI-generated clip of two cheerleaders performing on a football field and asked what is happening in the video. It said the video shows two college cheerleaders performing an energetic routine on a football field, then described attire and actions. It misread a printed word on clothing, but the rest of the description was solid. I asked it to point to the pom poms, and it returned coordinates for the targets.
![]()
Relationship Reasoning Without Hallucination
![]()
I used a video of a couple walking through a forest in autumn and asked what is the relationship of these humans. The model admitted that it does not know the specific relationship, noted that walking hand in hand suggests a close and affectionate connection, and that they could be partners, friends, family members, or acquaintances. This was the behavior I wanted to see, as it avoided hallucination.
Kangaroos in a Boxing Ring
I used a video with two kangaroos sparring in a boxing ring and asked which kangaroo should I bet on as winner. The model said I should bet on the kangaroo on the left, adding that the left kangaroo has an advantage in terms of boxing gloves and appears more aggressive, while the right kangaroo is on the defensive. I am not sure about that reasoning, but it identified the colors correctly and noted that the left kangaroo attacked first.
Molmo 2 The Architecture in Simple Words
![]()
The architecture is fairly simple. It processes video frames or images through a vision encoder that turns them into visual tokens, then uses a lightweight connector to interleave those tokens with timing information and text before feeding them into a powerful language model backbone. This setup enables bidirectional attention across visual tokens from different frames and boosts performance on video and multi-image tasks.
![]()
Performance Overview
![]()
On the shared benchmarks, Molmo 2 leads openweight models in areas like video tracking, short video question answering, counting with visual grounding, and image reasoning, while staying competitive with much larger proprietary systems trained on curated high quality data.
Final Thoughts
Molmo 2 came in with great benchmarks and strong real-world performance in my tests across images, videos, and multi-image setups. The grounding features are practical and fast, pointing to exact coordinates and timestamps and tracking objects over time. It is open weight, Apache 2 licensed, and trained on open datasets. A fairly good model to end the year on a high note.
Related Posts

Chroma 4B: Exploring End-to-End Virtual Human Dialogue Models
Chroma 4B: Exploring End-to-End Virtual Human Dialogue Models

Qwen3-TTS: Create Custom Voices from Text Descriptions Easily
Qwen3-TTS: Create Custom Voices from Text Descriptions Easily

How to Fix Google AI Studio Failed To Generate Content Permission Denied?
How to Fix Google AI Studio Failed To Generate Content Permission Denied?

