M
MeshWorld.
Gemma Gemma 4 Ollama Local AI Google AI LLM Privacy How-To 11 min read

How to Install Gemma 4 Locally with Ollama (2026 Guide)

Vishnu
By Vishnu

Gemma 4 is Google’s latest open-weight language model — a significant leap from Gemma 3 with better reasoning, longer context, and improved coding performance. Unlike cloud APIs, running it locally means zero data leaves your machine. Perfect for proprietary code, air-gapped environments, or just avoiding subscription fees.

Gemma 4 comes in four sizes: E2B and E4B for edge devices (phones, Raspberry Pi, IoT), and 26B MoE plus 31B Dense for workstations. All models are multimodal (vision + audio on edge models), support 140+ languages, and now use the permissive Apache 2.0 license.

:::note[TL;DR]

  • Gemma 4 comes in four sizes: E2B, E4B (edge/mobile), 26B MoE, and 31B Dense (workstation/server)
  • E2B/E4B run on phones, Raspberry Pi, Jetson Nano with 128K context
  • 26B MoE activates only 3.8B params for fast inference; 31B Dense for maximum quality with 256K context
  • All models are multimodal (vision + audio on edge) and support 140+ languages
  • Install Ollama, then ollama pull gemma4:27b — models download automatically on first use
  • Apple Silicon gets GPU acceleration; NVIDIA needs ~24GB+ VRAM for the 31B model
  • Now under Apache 2.0 license (not Google’s custom license) — truly open for commercial use :::

Prerequisites

Before installing Gemma 4, check your hardware:

Minimum (CPU only):

  • 4 GB RAM for E2B models (edge/IoT)
  • 8 GB RAM for E4B models
  • 16 GB RAM for 26B MoE models
  • 32 GB RAM for 31B Dense models

Edge/Mobile (E2B/E4B):

  • Runs on Raspberry Pi 4/5, NVIDIA Jetson Orin Nano
  • Android phones with 6GB+ RAM
  • iOS devices (via Core ML)
  • 128K context window

Better performance (GPU):

  • Apple Silicon Mac (M1/M2/M3/M4) — Metal acceleration works out of the box
  • NVIDIA GPU with 8+ GB VRAM for E4B models
  • NVIDIA GPU with 16+ GB VRAM for 26B MoE
  • NVIDIA GPU with 24+ GB VRAM for 31B Dense
  • 256K context window for 26B/31B models

Key Features:

  • Multimodal: Vision + audio understanding on all models
  • Multilingual: Native support for 140+ languages
  • Agentic: Native function calling and structured JSON output
  • License: Apache 2.0 (fully permissive for commercial use)
  • Context: 128K (E2B/E4B) or 256K (26B/31B) tokens

Install Ollama

If you don’t have Ollama yet, install it first:

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com. Runs as a background service.

Verify installation:

ollama --version

Download and Run Gemma 4

Ollama makes this trivial. Models download on first use and cache for future runs.

# Run the E2B model (edge/IoT, ~2GB, fastest on limited hardware)
ollama run gemma4:2b

# Run the E4B model (edge/IoT, ~3GB, better quality than E2B)
ollama run gemma4:4b

# Run the 26B MoE model (desktop, activates 3.8B params, fast inference)
ollama run gemma4:27b

# Run the 31B Dense model (workstation, maximum quality, 256K context)
ollama run gemma4:31b

The Scenario: You’re deploying an AI assistant on a Raspberry Pi 5 at a remote factory. You pull gemma4:2b, get local vision + audio processing with 128K context, and it all runs offline without internet. The E2B model handles OCR from camera feeds and voice commands natively.

First launch downloads the model weights:

  • E2B: ~2GB
  • E4B: ~3GB
  • 26B MoE: ~16GB (fits on 80GB H100 unquantized, ~7GB quantized)
  • 31B Dense: ~19GB (fits on 80GB H100 unquantized, ~8GB quantized)

Subsequent starts are instant.

Available Model Variants

Gemma 4 offers quantized variants for different VRAM constraints:

VariantEffective SizeVRAM NeededBest ForContext
gemma4:2b (E2B)~2 GB3-4 GBRaspberry Pi, IoT, phones128K
gemma4:4b (E4B)~3 GB4-6 GBEdge devices, Jetson Nano128K
gemma4:27b (26B MoE)~16 GB (activates 3.8B)12-16 GBFast desktop inference256K
gemma4:31b (31B Dense)~19 GB24+ GBMaximum quality, fine-tuning256K
gemma4:27b-q4_K_M~7 GB8-10 GBMid-range GPUs (26B MoE)256K
gemma4:31b-q4_K_M~8 GB10-12 GBHigh-end consumer GPUs256K

Key difference: The 26B MoE activates only 3.8 billion parameters during inference — delivering exceptional tokens/second while still having 26B total capacity. The 31B Dense uses all parameters for maximum quality.

Pull a quantized variant:

ollama pull gemma4:31b-q4_K_M

:::tip The q4_K_M quantization uses 4-bit precision with intelligent mixing. You lose ~2-3% quality but save 30-40% VRAM. Most users won’t notice the difference for everyday coding tasks. :::

Hardware-Specific Setup

Apple Silicon (M1/M2/M3/M4)

No configuration needed. GPU acceleration works automatically via Metal:

ollama run gemma4:12b

On an M2 Pro with 16GB unified memory, the 12B model runs at ~25 tokens/second. The 27B model also runs on M-series chips with 24GB+ RAM, though you may need to close other apps.

NVIDIA GPUs

Install the NVIDIA Container Toolkit for maximum throughput. Verify CUDA is available:

ollama ps  # Shows if GPU is being used

:::warning If you see “CUDA out of memory” errors, your model is too large for your VRAM. Kill the process with ollama stop gemma4:27b and switch to a smaller variant or quantized version. :::

CPU-Only Systems

Gemma 4 runs on CPU if you lack a compatible GPU. It’s slower but functional:

# Force CPU mode if needed
export OLLAMA_NO_GPU=1
ollama run gemma4:2b

Expect 2-5 tokens/second on a modern CPU for the E2B model. Usable for simple queries on edge devices.

Edge Devices (Raspberry Pi, Jetson Nano)

The E2B and E4B models are engineered specifically for edge:

# On Raspberry Pi 5 with 8GB RAM
ollama run gemma4:2b

# On NVIDIA Jetson Orin Nano
ollama run gemma4:4b

Features on edge:

  • Vision: Process camera frames locally for OCR, object detection
  • Audio: Native speech recognition and understanding
  • Offline: Works without internet after initial download
  • Low latency: Near-zero response time for real-time applications

Using the REST API

Ollama exposes an OpenAI-compatible API at localhost:11434:

Basic chat completion

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:31b",
  "messages": [
    { "role": "user", "content": "Explain recursion in Python" }
  ],
  "stream": false
}'

Generate (single prompt)

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:31b",
  "prompt": "Write a Python function to reverse a linked list",
  "stream": false
}'

OpenAI-compatible endpoint

Any library that works with OpenAI can point to Ollama:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)

response = client.chat.completions.create(
    model='gemma4:12b',
    messages=[{'role': 'user', 'content': 'Refactor this function'}]
)
print(response.choices[0].message.content)

Python SDK Usage

Install the official Ollama Python library:

pip install ollama

Basic usage:

import ollama

response = ollama.chat(
    model='gemma4:31b',
    messages=[
        {'role': 'user', 'content': 'Write a bash script to find large files'}
    ]
)
print(response['message']['content'])

Streaming for real-time output:

stream = ollama.chat(
    model='gemma4:31b',
    messages=[{'role': 'user', 'content': 'Tell me a joke'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

IDE Integration

Continue.dev (VS Code / JetBrains)

Add to your Continue config:

{
  "models": [
    {
      "title": "Gemma 4 31B (Local)",
      "provider": "ollama",
      "model": "gemma4:31b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 26B MoE Autocomplete",
    "provider": "ollama",
    "model": "gemma4:27b"
  }
}

The Scenario: You’re on a plane with no Wi-Fi. Open VS Code, hit Tab for autocomplete, and Gemma 4 suggests the next line. Local AI doesn’t need the internet.

Cursor

In Cursor settings, add a custom OpenAI-compatible model:

  • Base URL: http://localhost:11434/v1
  • Model: gemma4:31b

Claude Code

Pipe files to your local Gemma 4 instance:

claude -p "Review this code for bugs" < src/utils/parser.ts

Useful Commands

ollama list                  # show downloaded models
ollama pull gemma4:31b       # download a specific variant
ollama rm gemma4:27b         # remove a model to free space
ollama show gemma4:31b       # model info and parameters
ollama ps                    # show running models
ollama stop gemma4:31b       # stop a running model
ollama run gemma4:4b "prompt" # one-shot, non-interactive

Performance Comparison

Approximate tokens/second on different hardware:

HardwareE2BE4B26B MoE31B Dense
Raspberry Pi 5 (8GB)8 t/s4 t/sN/AN/A
M2 Pro (16GB)45 t/s35 t/s30 t/s15 t/s
RTX 4090 (24GB)90 t/s75 t/s65 t/s35 t/s
RTX 3060 (12GB)30 t/s25 t/s20 t/sN/A
CPU (i7-12700K)5 t/s3 t/s<1 t/s<1 t/s

Numbers are approximate — actual speed varies by prompt length and context window usage. The 26B MoE model activates only 3.8B parameters during inference, making it surprisingly fast for its size.

Prompting Tips

Gemma 4 responds well to direct, specific prompts:

For coding:

You are an expert Python developer. Write a clean, documented function that [task]. Include type hints and a docstring.

For explanation:

Explain [topic] as if I'm a senior developer who knows [related tech] but is new to this specific concept. Be concise.

For review:

Review this code for bugs, performance issues, and style violations. Rate each on severity (low/medium/high).

Troubleshooting

”Error: model not found”

Run ollama pull gemma4:12b first to download the weights.

Out of memory errors

Switch to a smaller model or quantized variant. Use Activity Monitor (macOS) or nvidia-smi (Linux) to check memory usage.

Slow performance

  • Verify GPU acceleration: ollama ps should show the model
  • Try a smaller model variant
  • Close other memory-heavy applications
  • Check thermal throttling on laptops

API connection refused

Ensure Ollama server is running:

ollama serve  # starts the server

Summary

  • Gemma 4 runs fully offline via Ollama — no API keys, no data leaks
  • Four sizes: E2B and E4B for edge/mobile (128K context), 26B MoE and 31B Dense for workstations (256K context)
  • 26B MoE activates only 3.8B parameters for fast inference; 31B Dense for maximum quality
  • Quantized variants (q4_K_M) save VRAM with minimal quality loss
  • Apple Silicon gets automatic GPU acceleration; NVIDIA needs sufficient VRAM
  • Multimodal: Vision + audio understanding on all models
  • Multilingual: Native support for 140+ languages
  • Apache 2.0 license — fully permissive for commercial use
  • OpenAI-compatible API works with existing tools and libraries

Frequently Asked Questions

What’s the difference between Gemma 3 and Gemma 4?

Gemma 4 improves reasoning, coding performance, and instruction following. The 31B Dense model ranks #3 on the Arena AI open-source leaderboard, outperforming models 20x its size. Key upgrades include:

  • Multimodal support (vision + audio) on all models
  • 140+ languages natively
  • 128K context (E2B/E4B) or 256K context (26B/31B)
  • Apache 2.0 license (was Google’s restrictive custom license)
  • Native function calling and agentic workflow support

Can I run Gemma 4 without internet after the initial download?

Yes. Once you ollama pull the model, it runs entirely offline. The weights are stored in ~/.ollama/models/. No cloud connection required for inference. This is ideal for air-gapped environments, privacy-sensitive work, or deployments without reliable internet.

Which Gemma 4 size should I choose?

  • E2B (2B effective): Raspberry Pi, IoT devices, phones, real-time edge processing with vision/audio
  • E4B (4B effective): Jetson Nano, Android devices, better quality than E2B while still edge-friendly
  • 26B MoE (Mixture of Experts): Desktop workstations, fast inference (activates only 3.8B params), coding assistants
  • 31B Dense: High-end GPUs, maximum quality, fine-tuning, complex reasoning tasks

How does the 26B MoE model work?

MoE (Mixture of Experts) means the model has 26 billion total parameters but only activates 3.8 billion during each inference pass. It routes each token to the most relevant “expert” sub-networks. This gives you fast tokens-per-second comparable to a 4B model, with the quality of a much larger model.

Can I use Gemma 4 for commercial projects?

Yes. Gemma 4 uses the Apache 2.0 license — the same permissive license used by Android, Kubernetes, and TensorFlow. You can use it commercially, modify it, distribute it, and even build proprietary products on top of it. No usage restrictions, no attribution requirements beyond the license text.