MeshWorld MeshWorld.
Claude Code Gemma 4 Ollama Local AI Developer Tools How-To 9 min read

How to Use Gemma 4 with Claude Code via Ollama (April 2026)

Darsh Jariwala
By Darsh Jariwala

Ollama added native Anthropic Messages API compatibility in early 2026. Claude Code speaks that same API. Put them together and Claude Code routes requests to a local Gemma 4 model instead of Anthropic’s servers — no API costs, no data leaving your machine, full Claude Code workflow intact (file editing, tool calling, shell execution).

This guide covers the exact setup as of April 2026: correct env vars, right model tags, context window gotcha, and what works vs what doesn’t.

:::note[TL;DR]

  • Ollama now exposes an Anthropic-compatible Messages API at http://localhost:11434 (not the /v1 path — that’s the OpenAI compat layer)
  • Env vars: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
  • Correct Gemma 4 tags: gemma4:e4b (default, 9.6GB), gemma4:26b (18GB MoE), gemma4:31b (20GB Dense)
  • Claude Code needs at least 64K context — set num_ctx 65536 in a Modelfile or Ollama will default to much less
  • Tool calling (file edits, bash) works; tool_choice forced use and prompt caching do not :::

Prerequisites

  • Claude Code installed: npm install -g @anthropic-ai/claude-code
  • Ollama installed (v0.14.0+ required for Anthropic API compat; v0.14.3+ for stable streaming tool calls)
  • Hardware for Gemma 4:
    • gemma4:e4b — 9.6GB, runs on 16GB RAM Mac or 8GB VRAM GPU
    • gemma4:26b — 18GB, needs M2 Pro/Max or RTX 3090+
    • gemma4:31b — 20GB, needs 24GB+ VRAM or M3 Max/M4 Max

Check your Ollama version:

ollama --version
# Should be 0.14.0 or later

If you need to update:

# macOS
brew upgrade ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Step 1: Pull Gemma 4

The correct model tags for Gemma 4 in Ollama (as of April 2026):

# Default — E4B edge model, 9.6GB, 128K context, best for most machines
ollama pull gemma4

# E2B — lighter, 7.2GB, 128K context
ollama pull gemma4:e2b

# 26B MoE — activates 4B params, 18GB, 256K context (best coding quality)
ollama pull gemma4:26b

# 31B Dense — 20GB, 256K context, maximum quality
ollama pull gemma4:31b

gemma4 with no tag pulls gemma4:e4b — the 9.6GB edge model. For Claude Code coding tasks, gemma4:26b is the best quality-to-speed tradeoff if your machine can handle the 18GB.

Verify the download:

ollama list
# NAME            ID              SIZE    MODIFIED
# gemma4:26b      abc123def456    18 GB   2 minutes ago

Step 2: Fix the context window before connecting

This is the most common failure point. Ollama defaults to a low context window (2K–4K tokens). Claude Code sends long system prompts plus file contents — it needs at least 64K tokens to function properly. Without this, Claude Code will fail mid-task with truncation errors.

Create a Modelfile that sets the context to 64K:

mkdir -p ~/.ollama/Modelfiles

cat > ~/.ollama/Modelfiles/gemma4-claude <<'EOF'
FROM gemma4:26b

PARAMETER num_ctx 65536
PARAMETER temperature 0.2
PARAMETER top_p 0.9
EOF

Build the custom model:

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

Verify it’s available:

ollama list
# gemma4-claude    ...    18 GB

Use gemma4-claude (not gemma4:26b) when connecting to Claude Code. The base model ignores context size; this variant enforces 64K.


Step 3: Set the environment variables

Ollama’s Anthropic-compatible endpoint is at http://localhost:11434not http://localhost:11434/v1. The /v1 path is Ollama’s OpenAI-compatible layer. Claude Code uses the Anthropic protocol, which maps to the root endpoint.

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

The ANTHROPIC_AUTH_TOKEN value can be any non-empty string — Ollama ignores it but Claude Code requires it to be set. Setting ANTHROPIC_API_KEY="" prevents Claude Code from falling back to a real API key if one is set in your environment.

Start Claude Code with the model:

claude --model gemma4-claude

Quick test:

claude --model gemma4-claude -p "What model are you?"

Gemma 4 will describe itself. If you see a connection error, confirm Ollama is running:

ollama serve   # start if not running
curl http://localhost:11434/api/tags   # confirm it responds

Step 4: Per-project config with .claude/settings.json

Setting env vars globally means every Claude Code session on every project routes to Ollama. Usually you want this scoped to specific projects.

In your project root:

mkdir -p .claude
cat > .claude/settings.json <<'EOF'
{
  "model": "gemma4-claude",
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  }
}
EOF

Claude Code loads .claude/settings.json automatically when run from that directory. Projects without this file continue using the Anthropic API normally.

Add it to .gitignore if you don’t want teammates picking up local model config:

echo ".claude/settings.json" >> .gitignore

Or commit it if local-model-first is a team decision — just ensure everyone has Ollama running.


Step 5: Document the setup in CLAUDE.md

Claude Code reads CLAUDE.md in the project root as persistent context for every session. Add your model routing strategy so it’s clear when to use local vs cloud:

## Model setup

This project defaults to Gemma 4 locally via Ollama (see `.claude/settings.json`).

To switch to Claude for a specific task:
```bash
ANTHROPIC_BASE_URL="" ANTHROPIC_API_KEY="your-key" claude --model claude-sonnet-4-6

Use Claude API for: multi-file refactors, complex debugging, architecture decisions. Use Gemma 4 locally for: file summaries, boilerplate generation, single-file edits, tests.


---

## What works and what doesn't

Ollama's Anthropic compatibility layer supports most of what Claude Code needs, but not everything. As of April 2026:

**Works:**
- Messages API (multi-turn conversation)
- Streaming responses
- System prompts
- Tool calling — file reads, file edits, bash execution
- Vision / image input (base64 only, not URL)
- Temperature, top_p, stop sequences

**Does not work:**
- `tool_choice` (forced tool selection) — Claude Code uses this occasionally; it silently falls back to auto
- Prompt caching — no performance benefit from repeated identical prompts
- Extended thinking / budget_tokens — parameter is accepted but not enforced
- URL-referenced images — only base64 works
- Token counting endpoint

In practice, basic Claude Code tasks — file reads, edits, bash commands, test generation — work correctly. The missing `tool_choice` support means Claude Code may occasionally pick the wrong tool on its first attempt, but it self-corrects.

---

## Switching back to Claude API for a single task

```bash
# One-off cloud task without changing settings.json
ANTHROPIC_BASE_URL="" ANTHROPIC_API_KEY="sk-ant-..." claude --model claude-sonnet-4-6 -p "Refactor auth module"

Or switch mid-session with the slash command:

/model claude-sonnet-4-6

Note: /model changes the model but not ANTHROPIC_BASE_URL. If the URL is still pointing at Ollama, the model name is passed to Ollama, which will error if it doesn’t have that model. Clear the URL in the env before switching to a cloud model.


Performance on common Claude Code tasks

Approximate response times with gemma4:26b (18GB MoE, 64K context):

TaskM2 Pro 16GBM3 Max 36GBRTX 4090 24GB
Explain a 200-line file~6s~3s~2.5s
Write a unit test~8s~4s~3s
50-line code generation~10s~5s~4s
Multi-file refactor plan~20s~10s~8s

gemma4:26b activates only 4B parameters during inference despite its 26B total size — which is why it’s faster than you’d expect. The gemma4:31b Dense model is 2–3× slower and noticeably better on complex reasoning tasks.


Troubleshooting

Error: connect ECONNREFUSED 127.0.0.1:11434

Ollama isn’t running. Start it:

ollama serve

On macOS, check the menubar — Ollama runs as a menubar app after install.

model "gemma4-claude" not found

The Modelfile build didn’t complete. Rebuild:

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude
ollama list   # confirm it appears

Claude Code truncates mid-response or fails with context errors

The context window isn’t large enough. Edit your Modelfile and increase num_ctx:

PARAMETER num_ctx 131072   # 128K

Then rebuild: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

Requests still going to Anthropic API despite settings.json

Check the settings.json is in the project root (same directory as CLAUDE.md and where you run claude), and validate the JSON:

cat .claude/settings.json | python3 -m json.tool

Tool calls not working / Claude Code can’t edit files

Streaming tool calls require Ollama v0.14.3+. Check your version and update if needed:

ollama --version
brew upgrade ollama   # macOS

Summary

  • Ollama’s Anthropic API at http://localhost:11434 (not /v1) is what Claude Code connects to
  • Three env vars: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
  • Always build a Modelfile with num_ctx 65536 — default context is too small for Claude Code
  • gemma4:26b is the practical choice: 18GB, 256K context, fast MoE inference
  • Tool calling works; tool_choice and prompt caching don’t — expect occasional first-attempt wrong tool, self-corrects
  • Scope config to .claude/settings.json per-project rather than global env vars

FAQ

Does this actually save money compared to the Anthropic API?

For light usage, Claude API costs are low enough that local setup overhead may not be worth it. Where local makes clear sense: privacy-sensitive codebases where data shouldn’t leave your network, sustained heavy usage (thousands of Claude Code invocations per day), or offline work with no connectivity. If you’re doing a few dozen claude -p calls per day, the API cost is negligible.

Can I use gemma4:31b instead of gemma4:26b?

Yes. Create a separate Modelfile pointing to gemma4:31b and build it as gemma4-claude-31b. The 31B Dense model gives noticeably better output on complex multi-step reasoning, at 2–3× the latency and 20GB vs 18GB memory. Worth it on a machine with 32GB+ VRAM; marginal on 24GB.

Will this work on Apple Silicon?

Yes. Ollama uses Metal automatically on Apple Silicon — no configuration needed. The gemma4:26b model runs well on M2 Pro (16GB) at around 20–25 tokens/second. M3 Max (36–48GB) handles gemma4:31b comfortably.