TL;DR
Gemma 4 12B now runs locally on consumer GPUs with audio and vision processing, cutting latency by bypassing separate encoders.
Key points
- 1
Encoder-Free Architecture: Gemma 4 12B eliminates traditional vision and audio encoders by projecting raw 48x48 pixel images and 16 kHz audio directly into the LLM. This reduces latency by skipping multi-stage encoders and uses only 35M parameters for vision (vs. 27 transformer layers in previous models). For audio, it slices 40ms frames into 640 floats and projects them linearly—cutting the need for 12 conformer layers. This means developers can run multimodal tasks locally without separate encoders, saving memory and speeding up inference. For example, a developer using llama.cpp with Gemma 4 12B can process images and audio in real-time on a 16GB VRAM laptop without heavy preprocessing steps.
- 2
Audio Support for Local Use: Gemma 4 12B is the first Gemma model to natively handle audio inputs at a medium scale, unlike earlier edge models that required lightweight architectures like E4B. This allows developers to process speech directly without converting audio to text first. The model slices audio into 40ms frames (640 floats each) and feeds it to the LLM, enabling real-time voice interactions. For instance, a developer building a voice-controlled app using Google AI Edge Eloquent can now handle conversational inputs with Gemma 4 12B on Macs without additional audio processing pipelines, making it ideal for local agentic applications.
- 3
MacOS Desktop Integration: Google is now releasing downloadable macOS apps for Gemma 4 12B, allowing developers to run local spoken and visual interactions on consumer devices. The Google AI Edge Gallery app runs natively on Apple Silicon GPUs with a secure sandboxed Python loop for chart plotting, while Google AI Edge Eloquent supports voice editing via Gemma 12B. This means developers can deploy multimodal apps directly on Macs without cloud dependencies. For example, a developer can use the Edge Gallery app to create a local image processing tool that generates video outputs from images, all running offline on their Mac.
- 4
LiteRT-LM for Local APIs: Developers can run Gemma 4 12B as an OpenAI-compatible API server using the litert-lm serve CLI, which includes stateless prefix caching to eliminate prefill latency. This allows seamless integration with tools like OpenCode and Hermes. The command `litert-lm serve` starts a local server that handles context history in memory, enabling real-time interactions. For instance, a developer using the CLI can build a local agent that processes video frames and audio in under 100ms, which is critical for applications like real-time video analysis without cloud delays.
What changed
Before this update
Multimodal models required separate vision and audio encoders, causing high latency and fragmented memory.
After this update
Gemma 4 12B uses a single decoder-only architecture that processes vision and audio directly into the LLM, reducing latency and memory overhead.
Share this update
This is a summary of an official post from the Google Search Central Blog, provided for quick reading. Google and the Google logo are trademarks of Google LLC; My Tool Studio is not affiliated with Google. Always refer to the original announcement for authoritative guidance.