Run ROSE Locally: Remove Objects from Any Video, Privately

Introduction

If you want to remove objects from videos on your own machine, ROSE makes it practical and private. I’ll show how I set it up locally, explain how it works, and walk through real tests to set clear expectations.

Everything happens on local hardware. No data leaves your system, and you control the entire workflow from installation to output.

I’ll keep the steps straightforward: install, run the app, pick an object, track it, and inpaint the frames. You’ll also see performance notes, GPU memory use, and results on short and longer clips.

System Setup and Requirements

I ran ROSE on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM). The model runs best with a modern NVIDIA GPU and a recent CUDA stack.

A dedicated GPU is strongly recommended. Object removal uses video tracking and diffusion-based inpainting, which are computationally heavy and will stress both GPU and CPU.

Disk space must accommodate model weights, dependencies, and intermediate files. The first run downloads weights, so plan for additional time and storage.

Installation Steps

I created a fresh environment and cloned the ROSE repository locally. All installation is done from the repository root.

Clone the repository.
Move into the project folder.
Install the required dependencies with your Python package manager.
Verify your environment sees the GPU (CUDA and drivers configured).
Keep a terminal open; you’ll use it to launch the app.

When you install prerequisites at the repo root, the process takes a few minutes. Once finished, you’re ready to start the local app and download the model weights on first run.

What Is ROSE?

ROSE stands for Remove Objects with Side Effects. It focuses on removing not only objects but also the visual traces they leave across frames.

Shadows, reflections, lighting spill, translucency bleed, and mirror images often betray edits. ROSE is designed to address these artifacts across time and space in a video.

Paired real data (with and without objects and their side effects) is rare. The training pipeline synthesizes such pairs using a renderer, enabling robust learning across varied scenes.

Table Overview: ROSE

Aspect	Summary
Name	ROSE (Remove Objects with Side Effects)
Purpose	Object removal in videos with side-effect cleanup
Runs Locally	Yes
Key Method	Diffusion transformer for video inpainting with side-effect prediction
Input	Full video for temporal-spatial reasoning
Output	Edited video with object and side effects removed
Privacy	All processing on your machine
Interface	Gradio web app on localhost

Key Features of ROSE

Full-video reasoning: Feeds the entire clip so the model can find where the object influences the scene, frame to frame.
Side-effect handling: Targets shadows, reflections, lighting spill, translucency, and mirror artifacts.
Integrated tracking and inpainting: Track once, then inpaint across the video with diffusion-based synthesis.

How ROSE Works (Concept)

Traditional frame-by-frame patches leave temporal glitches or visible seams. ROSE instead reasons across the whole clip, detecting where the object and its effects appear and fade.

The pipeline identifies regions to edit, then inpaints them coherently through time. This aims for consistent results without manual frame-by-frame retouching.

You select the object in a reference frame, track it, and let the model inpaint those regions. The output preserves scene structure and motion while removing the target.

Architecture: Under the Hood

ROSE is a diffusion transformer video inpainting model with an extra prediction head for side-effect regions. During training, both the original and edited (object-removed) clips are encoded into a latent space using a variational autoencoder.

The difference between these latents produces a ground-truth difference mask highlighting side-effect areas. A small MLP predicts this mask from features.

This predicted mask is concatenated into the diffusion backbone at several blocks. The diffusion transformer then synthesizes the edited video, guided by standard diffusion objectives.

Launching the Local App

From the repo root, launch the Gradio demo. On first run, the app downloads model weights and supporting components, including a variational autoencoder and text encoders.

This initial download can take time. Keep the terminal open until you see the local URL.

Access the interface at http://localhost:7860. You’ll see controls to load a video, create a mask, track the object, and inpaint the result.

Using ROSE: Step-by-Step

Follow this workflow for each clip:

Load a video
- Import a short test clip to start.
- Click “Get video info” to extract a reference frame.
Create the object mask
- Click or brush over the object in the reference frame.
- The app creates a mask for tracking.
Track the object
- Run tracking across frames.
- Review the mask overlay to confirm coverage.
Inpaint
- Start inpainting to remove the object and side effects.
- Wait for processing to complete, then preview the output.
Export
- Download or save the edited video.
- If needed, repeat the steps to remove remaining side effects, such as shadows.

Notes on First Run

The first inpainting job may load additional weights into memory.
Keep an eye on VRAM usage if you have other GPU-heavy tasks running.
Longer clips and higher resolutions increase processing time.

Example 1: Short Clip (Cow Removal)

I tested an 8-second video of a cow walking. After clicking “Get video info,” I selected the cow in the reference frame to generate the mask.

I ran tracking, which completed quickly. VRAM usage during tracking stayed around 12–13 GB on my GPU.

Inpainting took about 5–6 minutes. The cow was removed across the clip with only a faint silhouette in a few frames, which is acceptable for many use cases.

Example 2: Longer Clip (Monitor Removal)

Next, I tried a longer clip of my own recording. I selected a monitor in the frame, tracked it, and started inpainting.

Here, processing time increased sharply. VRAM usage jumped to around 21 GB, with CPU and system memory fully engaged.

After about 15 minutes, only 30% of inpainting was complete. Based on that pace, removing an object from a clip around half a minute could take close to an hour on this setup. I canceled this run to keep the workflow moving.

Example 3: Kangaroo Video

I then tested a short AI-generated kangaroo clip. I selected one group for removal, tracked it, and ran inpainting.

The output removed the target kangaroo across the clip while keeping the other intact. A shadow remained, which could be removed by selecting it and running another pass.

This test shows the model can separate subjects and remove one cleanly while maintaining the rest of the scene.

Performance and Resource Use

Performance varies by clip length, resolution, and object complexity. Tracking is usually quicker than inpainting, and most of the time is spent in diffusion-based synthesis.

Expect higher VRAM usage for longer clips, higher resolutions, and complex scenes with large moving objects. Plan your runs to match your GPU capacity.

The table below summarizes measured usage from my tests.

Test Case	Duration	VRAM (approx.)	Time to Inpaint (approx.)	Result Summary
Cow removal (short clip)	~8 seconds	12–13 GB	5–6 minutes	Object removed; faint silhouette in places
Longer clip (monitor)	~30 seconds	~21 GB	~1 hour (estimated)	Canceled after 15 min at ~30%
Kangaroo clip	Short	Not recorded	A few minutes	Target removed; shadow remained

Tips for Better Results

Mask all influences
- If shadows or reflections remain, select them and run an additional pass.
- Include edges and semi-transparent areas in your mask.
Keep clips short for iteration
- Test on short sections to confirm quality before processing full-length videos.
- Batch longer projects into segments if needed.
Monitor resources
- Watch VRAM during tracking and inpainting.
- Close other GPU-intensive apps while processing.

Troubleshooting

If tracking misses parts of the object:
- Refine the mask in the reference frame and retrack.
- Ensure the mask includes all moving parts.
If faint silhouettes remain:
- Run a second pass targeting the leftover region.
- Include shadows or reflections in the mask.
If VRAM runs out:
- Reduce resolution or process shorter segments.
- Restart the app to clear memory between large runs.

Privacy and Local Control

All processing runs locally on your machine. Videos, masks, and outputs never leave your environment.

This setup supports private workflows and sensitive content. Keep your machine’s storage and security in good order, as large assets and caches accumulate over time.

Practical Expectations

Short clips can complete in minutes with good results. Longer segments can take significantly longer, and VRAM demands increase with length and complexity.

Results are strong for many scenes, including moving subjects and dynamic backgrounds. Side effects such as shadows may need a second pass if they are prominent.

Working in iterations gives you faster feedback and helps you refine masks for a cleaner final output.

Summary Workflow

Install dependencies and launch the local app.
Load your video and extract the reference frame.
Mark the object with the mask tool.
Track across frames to confirm coverage.
Run inpainting and review the result.
Remove remaining side effects with a second pass if needed.
Export the final video.

Conclusion

ROSE brings precise video object removal to a local workflow with a focus on cleaning up side effects. It reasons across full clips, tracks targets, and inpaints regions with temporal consistency.

Setup is straightforward, and the local Gradio app makes the process accessible. Expect higher resource use for longer clips and plan runs accordingly.

If you keep masks accurate and iterate on complex scenes, ROSE can deliver clean removals with strong temporal consistency while keeping your work fully private.