Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI

I installed and integrated Gemma 4 E2B from Google with Hermes Agent on Ubuntu, serving the model locally through vLLM. I used an Nvidia RTX 6000 with 48 GB of VRAM. This setup gives me text, vision, and audio support from a small instruction-tuned model running on a single GPU.

Gemma 4 comes in multiple flavors, and I have already tested the family from local installation to multi-modality and fine-tuning on local data. Here I focus on Hermes Agent and how it works with Gemma 4 E2B across modalities. I have also shown integration with OpenClaw before, and Hermes is positioning itself as a competitor to that tool, so it is worth seeing how both behave in local agent workflows.

If you want a deeper comparison across model sizes, see our side-by-side analysis of Gemma 4 and Qwen 3.5 for practical guidance. For a direct family-level take, read this comparison of Gemma 4 vs Qwen 3.5. For a larger variant breakdown including Qwen 3.5 27B, see this head-to-head on Gemma 4 31B vs Qwen 3.5 27B.

Setup for Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI

I used the latest vLLM because Gemma 4 support landed recently and older builds may not expose all modalities. Some modalities might not work right away in fresh releases, and you can usually resolve that by upgrading vLLM again in a couple of days. Most of it works well already, and the model loads comfortably for local experimentation.

vLLM install and version check

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 61s

Install or upgrade vLLM to a recent version.

Run: pip install -U vllm

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 73s

Confirm the version after install.

Run: python -c "import vllm, sys; print(vllm.version)"

Download Gemma 4 E2B-it locally

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 88s

I pulled the model from Hugging Face into a local folder to serve it from disk. If you prefer direct loading from the hub in vLLM, you can skip this step.

Run: pip install -U "huggingface_hub[cli]"

Run: huggingface-cli login

Run: huggingface-cli download google/gemma-4-E2B-it --local-dir ./models/gemma-4-E2B-it --local-dir-use-symlinks False

The model card is here for reference: https://huggingface.co/google/gemma-4-E2B-it

Serve Gemma 4 E2B-it with vLLM

I served the local copy on port 8000 with an OpenAI-compatible API. I kept a larger context window, but you can lower it if VRAM becomes tight.

Run: vllm serve ./models/gemma-4-E2B-it --port 8000 --max-model-len 32768 --gpu-memory-utilization 0.90 --dtype float16

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 151s

The server may print warnings and minor errors given the bleeding-edge support. Loading took under 8 GB of VRAM in my case, and runtime usage can climb as the KV cache grows.

If you are exploring local agent frameworks beyond Hermes, I also covered an OpenClaw setup here: OpenClaw local agent workflow.

Hermes Agent integration with Gemma 4 E2B and vLLM

Hermes Agent has been covered in depth already. It is a capable agent framework with many built-in skills and a terminal backend, and it aims to compete with OpenClaw for local-first orchestration.

Install Hermes Agent

Install Hermes with a single command.

Run: pip install -U hermes-agent

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 212s

You can verify the installation afterwards.

Run: hermes --version

Point Hermes to the local vLLM endpoint

Hermes lets you add a custom model provider and route to your local OpenAI-compatible server. I configured it to use the model already being served on port 8000.

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 250s

Run: hermes config

Select custom endpoint and set base URL to http://localhost:8000/v1. Leave API key blank for a local unsecured server.

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 278s

Hermes can auto-detect the served model. Confirm the suggested model and keep defaults for local-only operation unless you want to wire up external channels.

After configuration, restart your shell so the environment updates take effect. Launch Hermes to confirm it loads and preloads skills.

Run: hermes

Hermes recognized the locally served Gemma 4 model and was ready to accept tasks with tool calls enabled. If you want a broader view of agent patterns and tooling, browse our category on AI agents.

Text, vision, and audio notes for Gemma 4 E2B on vLLM

I validated general knowledge prompts and tool usage through Hermes. For audio, Hermes does not currently pass audio natively end-to-end, so I accessed the vLLM endpoint directly from Python to test speech inputs.

If you only need browser-style automation and web workflows, here is a practical walkthrough on building a browser-capable agent: browser-use agent setup.

Audio transcription via the vLLM endpoint

The model supports more than 100 languages for speech inputs. I used the OpenAI-compatible client against vLLM for transcription-style prompts.

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 486s

Install the client.

Run: pip install -U openai

Then call the server with audio content.

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

def encode_audio(file_path):
    with open(file_path, "rb") as f:
        b = f.read()
    return base64.b64encode(b).decode("utf-8")

audio_b64 = encode_audio("audio_sample.wav")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Transcribe this audio and return plain text."},
            {
                "type": "input_audio",
                "audio": {"format": "wav", "data": audio_b64}
            },
        ],
    }
]

resp = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=messages,
    temperature=0.2,
)

print(resp.choices[0].message.content)

If your audio is mp3 or m4a, set the appropriate format field. Keep audio clips short and clean for best results.

Image understanding via the vLLM endpoint

I also tested reading text from a newspaper page and summarizing the main story. The model extracted visible headlines and produced a concise summary, with a few transcription mistakes that you can expect from a compact model.

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

img_b64 = encode_image("newspaper.jpg")

![Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 703s](/ai/gemma-hermes-agent/gemma-hermes-agent-703.webp)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "input_text", "text": "This is a newspaper image. Extract all headlines, then summarize the main story in three sentences."},
            {
                "type": "input_image",
                "image_url": f"data:image/jpeg;base64,{img_b64}",
            },
        ],
    }
]

resp = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=messages,
    temperature=0.2,
    max_tokens=512,
)

print(resp.choices[0].message.content)

Use PNG or JPG files with reasonable resolution. For large pages, crop to the content area for more accurate OCR and summarization.

Performance, VRAM, and stability tips

Model initialization used under 8 GB of VRAM for me. During long conversations, VRAM usage can grow significantly because of the KV cache, and I observed numbers around the tens of gigabytes on extended runs.

If you face modality-specific issues, this is normal with fresh support in vLLM for Gemma 4. Upgrade to the latest vLLM and restart the server.

Use cases on a single GPU

Run a local assistant for general knowledge, document QA, and tool-augmented tasks through Hermes. Attach web search, file tools, and structured outputs for routine workflows.

Transcribe multilingual audio snippets for quick notes from field recordings, interviews, or customer calls. Parse images like invoices, receipts, and newspapers, then summarize or extract structured fields for downstream processing.

For teams comparing model families before committing, refer to our analyses on Gemma 4 vs Qwen 3.5 and the more detailed size-focused Gemma 4 31B vs Qwen 3.5 27B.

Screenshot from Gemma 4 E2B, Hermes Agent, and vLLM: Free Local Multimodal AI at 562s

Final thoughts

Gemma 4 E2B runs fully locally for text, vision, and audio on a single GPU and works well with Hermes Agent for autonomous capabilities. Support in vLLM is evolving fast, and minor issues are usually cleared by frequent updates. If you are building local-first agents, also explore how OpenClaw compares in a similar stack: OpenClaw Local AI Agent.