Sonu Sahani logo
Sonusahani.com
Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking

0 views
7 min read
#AI

When you have a video analysis model with great benchmarks and amazing quality, then Christmas really has come. I installed Molmo 2 locally and tested it on images, videos, and multi-image scenarios. I also describe the architecture in simple words.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 1

Molmo 2 is a family of openweight vision models from Allen Institute designed to handle images, multi-image sets, and videos with strong grounding capabilities. Grounding in this context refers to the model's ability to connect its answers directly to specific visual evidence in the input rather than just describing what it sees in general terms. It can point to locations by outputting coordinates in images or frames, provide timestamps for events in videos, assign persistent ids to track the same object across frames even if it is temporarily hidden, and tie its reasoning or count to those exact spatial and temporal references.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 2

Installing Molmo 2 The Locally

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 3

I used an Ubuntu system with 1 GPU card Nvidia H100 with 80 GB of VRAM. Here is how I set it up:

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 4

  • Create a virtual environment.
  • Install all the prerequisites. Make sure that you use this Transformers version and Mulbo utilities.
  • Run the script obtained from their Hugging Face card. I put a Gradio interface on top of it, downloaded the model, and created an interface to play around with images and videos.
  • On the first run it downloads the model shards. There are eight shards of it.
  • Launch it and access it in the local browser.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 5

VRAM Usage With Molmo 2 The

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 6

  • Images: around 38.4 GB of VRAM. It would easily fit onto a 48 GB card for images.
  • Videos: around 43 GB of VRAM. Still under 48 GB, so a 48 GB GPU can fit it in, like an A6000.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 7

Image Tests With Molmo 2 The

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 8

Describing a Temple Image

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 9

I selected an image of a temple and asked it to describe the image. It responded that the image showcases a magnificent Hindu temple located in Tanjaw Tamil Nadu. I thought this was Madurai temple, but if you are from that region please confirm. It described the architecture really well even though the image was not that clear, calling it an architectural marvel and an important religious site.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 10

Grounded Pointing on a Photosynthesis Diagram

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 11

I used a photosynthesis diagram and asked it to point to the cloud. It was fast and returned coordinates with clear Target markers. I then asked it to point to the sun, and it correctly returned the target with coordinates.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 12

This model is open weight, Apache 2 licensed, open dataset, everything is open.

OCR and Grounding on Handwritten Text

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 14

It is not an OCR model as far as I know, but I tested it on a handwritten, AI-generated image and asked it to extract the text. It extracted the text well and noted that the word out appears to be a misspelling of all. I then asked it to point to cross-out words, and it returned multiple Target markers at the right places. Very impressive.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 15

Converting a Table to JSON

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 16

I asked it to convert a table into JSON. It defined all the columns and rows correctly and produced a clean JSON representation of the table. Wonderfully done.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 17

Quick Multilingual Check With Arabic

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 18

I tried a quick Arabic OCR test using an old newspaper cutting. It repeated text in places, but it understood that the piece was about diplomatic relations because the leaders from both countries were present. Not bad in terms of understanding.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 19

Video Tests With Molmo 2 The

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 20

Cheerleaders Routine Analysis and Pointing

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 21

I used an AI-generated clip of two cheerleaders performing on a football field and asked what is happening in the video. It said the video shows two college cheerleaders performing an energetic routine on a football field, then described attire and actions. It misread a printed word on clothing, but the rest of the description was solid. I asked it to point to the pom poms, and it returned coordinates for the targets.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 22

Relationship Reasoning Without Hallucination

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 23

I used a video of a couple walking through a forest in autumn and asked what is the relationship of these humans. The model admitted that it does not know the specific relationship, noted that walking hand in hand suggests a close and affectionate connection, and that they could be partners, friends, family members, or acquaintances. This was the behavior I wanted to see, as it avoided hallucination.

Kangaroos in a Boxing Ring

I used a video with two kangaroos sparring in a boxing ring and asked which kangaroo should I bet on as winner. The model said I should bet on the kangaroo on the left, adding that the left kangaroo has an advantage in terms of boxing gloves and appears more aggressive, while the right kangaroo is on the defensive. I am not sure about that reasoning, but it identified the colors correctly and noted that the left kangaroo attacked first.

Molmo 2 The Architecture in Simple Words

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 25

The architecture is fairly simple. It processes video frames or images through a vision encoder that turns them into visual tokens, then uses a lightweight connector to interleave those tokens with timing information and text before feeding them into a powerful language model backbone. This setup enables bidirectional attention across visual tokens from different frames and boosts performance on video and multi-image tasks.

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 26

Performance Overview

Molmo 2: Open-Source Vision AI for Video, Pointing & Tracking screenshot 27

On the shared benchmarks, Molmo 2 leads openweight models in areas like video tracking, short video question answering, counting with visual grounding, and image reasoning, while staying competitive with much larger proprietary systems trained on curated high quality data.

Final Thoughts

Molmo 2 came in with great benchmarks and strong real-world performance in my tests across images, videos, and multi-image setups. The grounding features are practical and fast, pointing to exact coordinates and timestamps and tracking objects over time. It is open weight, Apache 2 licensed, and trained on open datasets. A fairly good model to end the year on a high note.

sonuai.dev

Sonu Sahani

AI Engineer & Full Stack Developer. Passionate about building AI-powered solutions.

Related Posts