Sonu Sahani logo
Sonusahani.com
Meta Releases SAM 3: Run Promptable Video Segmentation Locally

Meta Releases SAM 3: Run Promptable Video Segmentation Locally

0 views
11 min read
#AI

Meta has released SAM 3, a promptable system for interactive video segmentation that can detect, segment, and track objects with text or visual prompts. It works with points, boxes, and masks, and responds quickly to simple instructions. The process feels natural: load a video, specify an object, and propagate the result through the entire clip.

In this guide, I install SAM 3 locally, explain how it works in practice, and walk through real tests. You’ll see how text prompts and propagation behave across different scenes, what kind of hardware it needs, and how well it performs on a range of targets.

What Is SAM 3?

SAM 3 is a unified model for promptable video segmentation. It supports text prompts, visual prompts, and frame-by-frame interaction, producing consistent masks and tracked instances across time. The model can also target an open vocabulary concept from a short phrase or a small set of examples.

It is designed for interactive workflows. You can adjust prompts, review frames, and propagate masks over the sequence. The interface used here is a simple Gradio app, so everything runs locally and is easy to control.

SAM 3 focuses on consistent object masks over time. You can specify one or more targets in a frame and propagate the segmentation to all frames in the video to create a stable output that you can download and review.

SAM 3 Overview:

AspectDetails
Core tasksDetect, segment, and track objects in video
Prompt typesText, points, boxes, masks
InteractionInteractive and promptable; per-frame review and propagation
Open vocabularyTargets concepts from short text phrases or example masks
Model size~3.4 GB (downloaded at first run)
VRAM usageJust under 4 GB during tests
CPU supportCan run on CPU; slower but functional
Local interfaceGradio; served at localhost:7860
OutputConsistent masks and tracked IDs across frames; downloadable results
Tested OS/GPUUbuntu; NVIDIA RTX 6000 with 48 GB VRAM

Key Features of SAM 3

  • Promptable segmentation across video with text or visual inputs.
  • Interactive per-frame control and global propagation.
  • Instance tracking with stable IDs.
  • Open vocabulary support for a wide range of concepts.
  • Lightweight footprint for a video model: ~3.4 GB weights and sub-4 GB VRAM use during tests.
  • Works locally with a simple Gradio interface.

Meta SAM 3 Install and Run Locally

SAM 3 can be run on a Linux machine with either a CUDA-capable GPU or CPU. In tests, it ran smoothly on Ubuntu with an NVIDIA RTX 6000, but the VRAM usage suggests mid-range GPUs can handle it as well. CPU-only execution is possible for smaller jobs or previews.

Prerequisites

  • Python 3.9+ and pip
  • Git
  • CUDA-enabled GPU and drivers (optional but recommended)
  • Enough disk space for the model (~3.4 GB) and your videos

Step-by-Step Installation

  • Set up a Python environment (virtualenv or conda recommended).
  • Install required packages from the model’s Hugging Face page.
  • Add a small Gradio wrapper to expose the model as a local web app.
  • Start the app to trigger the first-time model download.

The first launch downloads the ~3.4 GB weights automatically. On subsequent runs, startup is immediate.

Launch the Interface

  • Run the Gradio script.
  • The app serves at http://localhost:7860.
  • Load a video and choose your prompt type (text or visual).

From there, you can select frames, apply prompts, propagate across the timeline, and export the results.

Using the Gradio Interface

The interface keeps the workflow simple, letting you focus on prompts and outputs.

Load a Video

  • Upload a local video file.
  • Use the frame slider to browse through the clip.
  • Select a reference frame to define your prompt.

Inspecting frames before prompting helps ensure you target the right instance in the right context.

Apply Prompts

  • Text prompt: Type a short phrase (for example, “lion,” “human,” “bird,” “banana,” “flame”).
  • Visual prompt: Add points, draw a box, or provide a mask on the reference frame.

SAM 3 responds immediately, highlighting all detected instances that match your prompt on the current frame.

Propagate and Export

  • Click propagate to extend the segmentation through the entire video.
  • Review the mask consistency by scrubbing the timeline.
  • Export the smoothed result as a processed video or masks for further use.

During propagation, you’ll see progress and may observe instance IDs assigned to targets. If you spot issues, adjust prompts and re-propagate.

Hands-On Results

I ran SAM 3 on several videos to evaluate detection quality, temporal consistency, vocabulary coverage, and resource usage. The tests follow the exact sequence below.

Animals in Motion

I started with a wildlife clip featuring a pursuit scene. After uploading the video, I scanned a few frames with the slider and used the text prompt “lion.” The model isolated the lions cleanly and then propagated the segmentation across the entire clip.

Prompt: “lion”

  • Detection: Correctly segmented the target across varied motion and distance.
  • Propagation: Masks remained stable through occlusions and camera movement.
  • IDs: Consistent instance labeling across frames.

Performance and Resources

  • VRAM usage hovered just under 4 GB during inference.
  • CPU-only mode is feasible; performance naturally slows but remains stable.

This test established a baseline: SAM 3 handles multi-object motion and varying scales well with a single prompt.

People in a Crowd

Next, I tested a clip with coordinated human motion. The video featured a group scene, and the prompt “human” immediately highlighted the people of interest.

  • Detection: Accurate segmentation of individuals across the frame.
  • Propagation: Consistent IDs and masks across the timeline.
  • Export: The app supports downloading the smoothed output after propagation.

Compared to earlier versions of SAM, the promptable behavior across frames feels more robust. The model serves as a unified approach to multimodal prompts and segmentation, and it supports exhaustive segmentation of all instances matching an open vocabulary phrase.

Flock of Birds Over Water

I then moved to a more demanding case: a large flock flying over a river valley. I prompted with “bird” on a representative frame.

  • Detection: Most birds were segmented correctly on the key frames.
  • Propagation: Coverage remained good but not perfect; a few instances were missed, especially smaller or ambiguous shapes in motion.
  • Notes: In some frames, a clearly visible bird was missed, and a few detections were conservative around cluttered regions.

The model still performed strongly relative to prior SAM releases and other segmentation tools I’ve used. On the SU benchmark, which spans about 270k unique concepts (over 50 times more than many existing benchmarks), SAM 3 reports strong results. In practice, this breadth shows up as solid text-prompt coverage across a wide set of natural categories.

Single Object: Banana

For a simpler scene, I tested a close-up clip with a banana. The text prompt “banana” segmented the object accurately on the first try.

  • Detection: Clean mask with correct instance ID.
  • Propagation: Stable mask over the clip; no drift around edges.

This scenario highlights how well the model handles distinct objects with strong visual cues.

Small Target: Candle Flame

Finally, I tested a low-light scene with a person walking through a snowy forest holding a candle. The challenge is the small, bright flame.

  • Prompt: “flame”
  • Detection: The flame was identified accurately.
  • Propagation: The mask tracked the flame through motion and lighting changes.

This is a strong result for a small, dynamic target with high contrast against a dark background.

Model Design Highlights

SAM 3 focuses on practical, prompt-driven video segmentation. A few design aspects stand out during use:

  • Unified approach: One system handles text prompts, visual prompts, and per-frame masks.
  • Promptable segmentation: You can specify a target with a short phrase or simple visual cues.
  • Open vocabulary: The text interface covers a wide set of concepts without manual class lists.
  • Instance consistency: IDs and masks remain stable through time when you propagate.

The interaction model is simple by intent: prompt on a frame, preview, and propagate. That keeps the loop fast and supports iterative refinement.

Performance Notes

Resource usage stayed low through all tests:

  • Model size: ~3.4 GB download on first run.
  • VRAM: Just under 4 GB in typical cases.
  • CPU: Works on CPU, suitable for experimentation and smaller clips.
  • I/O: Propagation runs at interactive speeds on a modern GPU, with responsive feedback when scrubbing frames.

The Gradio interface also adds a clear workflow: upload a clip, prompt a concept, propagate, and export. It is straightforward to reset and switch between test videos.

Step-by-Step Guide: Running Your First Project

Follow this process to reproduce the results with your own clips.

Set Up the Environment

  • Create a fresh Python environment.
  • Install the model requirements from Hugging Face.
  • Add Gradio to expose a simple UI.
  • Launch the app to download the model weights.

Keep at least 5 GB free to accommodate the model, intermediate caches, and outputs.

Prepare and Load a Video

  • Choose a short video for a first run to validate performance.
  • Upload it in the Gradio UI.
  • Use the frame slider to find a clear reference frame.

Pick a frame where the target is unobstructed for best initial results.

Prompt and Propagate

  • Enter a short text phrase or add visual prompts.
  • Confirm that the correct instances are highlighted on the current frame.
  • Click propagate to extend the segmentation across the video.
  • Inspect frames for consistency and export the result.

If a target is missed during propagation, refine the prompt or add a visual cue and rerun. The update loop is quick, so iteration is easy.

Practical Observations From Testing

A few clear behaviors emerged during hands-on use:

  • Short, specific text prompts work best. Single-word phrases often suffice.
  • For crowded scenes, visual prompts can help target the correct instance on the key frame.
  • Propagation handles motion and moderate occlusions well.
  • Small, bright targets (like a flame) can be tracked reliably.
  • Open vocabulary coverage is broad; most everyday objects are handled correctly.

The system’s ability to propagate accurate masks across a full clip with minimal interaction makes it effective for quick segmentation passes and batch processing of longer sequences.

Limitations Noted in Tests

Some constraints appeared in more complex scenes:

  • Dense swarms or flocks may see occasional misses, especially for small or fast-moving subjects in cluttered backgrounds.
  • Conservative thresholds can skip borderline shapes; a mixed text-plus-visual prompt can help.
  • Long clips benefit from periodic spot checks to correct drift or re-affirm the target.

These are typical trade-offs for interactive video segmentation and can be mitigated with prompt refinements.

When to Use Text vs. Visual Prompts

Both prompt types are well supported, and the choice depends on the scene:

  • Text prompts are ideal for clear, common objects.
  • Visual prompts help in crowded scenes or when multiple similar objects appear close together.
  • Masks or boxes give explicit spatial guidance, reducing false positives on the first frame.

A good workflow is to start with text, then add a point or a box if you need to focus on a specific instance.

Exporting and Next Steps

After propagation, export the smoothed output. You can download a processed video with masks or retrieve per-frame mask data for integration in a pipeline. If you need edits, jump back to the reference frame, refine the prompt, and rerun propagation.

For larger projects, consider batching multiple clips and saving prompt presets. The low VRAM footprint makes it practical to process many short videos on a single workstation.

Conclusion

SAM 3 brings promptable, interactive video segmentation to local workflows with a simple interface and efficient resource use. It detects, segments, and tracks objects from short text phrases or visual prompts, and it maintains stable masks across time with one-click propagation. The model size is modest for a video system, and typical VRAM use stayed under 4 GB during tests.

Across a range of videos animals in motion, people in a crowd, a large flock, a single object, and a small flame SAM 3 delivered consistent, high-quality masks. It handled open vocabulary prompts effectively and improved over earlier releases in both interaction and stability. For anyone who needs quick, controllable video segmentation with minimal setup, running SAM 3 locally is a practical and capable option.

sonuai.dev

Sonu Sahani

AI Engineer & Full Stack Developer. Passionate about building AI-powered solutions.

Related Posts