How to Use Gemma 4 with Claude Code via Ollama — Google models, specific model tags, and what works vs what doesn't How to Install Ollama and Run LLMs Locally — deeper Ollama setup, model options, API usage Claude Code CLI Cheat Sheet — slash commands, flags, and REPL tricks

Claude Code + Ollama: Free Local AI Coding Setup (2026)

Q: How do you launch Claude Code with your local model?

You have options: **Automatic (easiest):** Ollama sets everything and opens Claude Code in your current directory.

Q: How do you run multiple models?

Switch models to compare outputs on the same task: Or build model variants with different settings: Lower temperature gives more deterministic output — better for code generation, worse for creative exploration.

Q: How do you troubleshoot connection errors?

**"Command not found: ollama"** Ollama isn't in your PATH. Restart your terminal. If that doesn't work, find the installation and add it to your PATH: **Claude Code says it can't access models**

Q: What are the real costs and savings?

Cloud API costs add up fast: **Anthropic Claude API:** ~$3 per 1M input tokens **OpenAI GPT-4:** ~$30 per 1M input tokens

Claude Code connects to either Anthropic’s cloud or a local Ollama model. Swap in Ollama and you get zero API costs, full offline capability, and your code never leaves your machine. This guide walks through the complete setup for 2026 — installation, env vars, model selection, and the gotchas that’ll trip you up if you don’t know them.

TL;DR

Install Ollama, pull a model, then run ollama launch claude --model <model> (Ollama v0.14.5+)
Or set env vars manually: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
Recommended first model: glm-4.7-flash (9GB, 128K context, fast) or qwen2.5-coder:7b (4GB, low-VRAM)
Claude Code needs 64K+ context window — create a Modelfile or use ollama launch to auto-configure
Tool calling, file editing, bash execution all work locally

Prerequisites

Before you start:

macOS, Linux, or Windows (WSL recommended on Windows)
8 GB RAM minimum (16 GB recommended for larger models)
20 GB free disk space
GPU (optional but strongly recommended): Apple Silicon, NVIDIA, or AMD. GPU inference is 5–20x faster than CPU.

Check your GPU if you have one:

bash

nvidia-smi

You should see your GPU listed with VRAM and CUDA version. If not, Ollama falls back to CPU — still works, just slower.

How do you install Ollama?

Ollama runs as a background service that manages your models and exposes an HTTP API. Once it’s running, Claude Code talks to it the same way it talks to Anthropic’s servers.

macOS / Windows:

Download the installer from ollama.com. Run it and Ollama starts automatically as a background service.

Linux (single command):

bash

curl -fsSL https://ollama.com/install.sh | sh

Verify the install:

bash

ollama --version
# ollama version 0.14.5 or later

Warning

Ollama v0.14.0 or later is required for Claude Code compatibility. Older versions don’t expose the Anthropic Messages API correctly. If you’re on an older version, upgrade with brew upgrade ollama (macOS) or re-run the install script (Linux).

Which model should you pull for Claude Code?

The model is your AI brain. Claude Code works with any Ollama model, but different models have different strengths.

Fastest setup — recommended for most people:

bash

ollama pull glm-4.7-flash

glm-4.7-flash is fast (~25 tokens/second on GPU), has a 128K token context window, and handles tool calling well. At ~9 GB it’s the sweet spot for machines with 16 GB RAM or an 8 GB VRAM GPU.

Best for complex refactoring:

bash

ollama pull qwen2.5-coder:32b

qwen2.5-coder:32b is the strongest open coding model. Best results for multi-file refactors and architectural decisions, but needs ~20 GB VRAM. Only on RTX 4090, 3090, or M3 Max+.

Low-VRAM machines (8 GB VRAM or less):

bash

ollama pull qwen2.5-coder:7b

~4 GB download, ~5 GB VRAM needed. Runs on CPU if no GPU available — slow (2–5 tokens/second) but functional.

Model comparison:

| Model | Size | VRAM | Context | Best for | |---|---|---|---|---| | glm-4.7-flash | ~9 GB | 11 GB | 128K | First-time users, fast tasks | | qwen2.5-coder:7b | ~4 GB | 5 GB | 32K | Low-VRAM, quick tasks | | qwen2.5-coder:32b | ~18 GB | 20 GB | 32K | Complex refactoring, multi-file | | glm-4.7:cloud | — | — | 128K | No GPU? Use hybrid |

Test your model after pulling:

bash

ollama run glm-4.7-flash "Hello, what model are you?"

You should get a response in 1–5 seconds. Slow? Check that your GPU is being used (nvidia-smi on NVIDIA, Activity Monitor on macOS) and that nothing else is hogging RAM.

How do you configure Claude Code to use Ollama?

Here are two ways to connect Claude Code to your local model.

Method 1: `ollama launch` (easiest, Ollama v0.14.5+)

If you have Ollama v0.14.5 or later, the easiest path is:

bash

ollama launch claude --model glm-4.7-flash

This automatically sets the environment variables and opens Claude Code in your current directory. Done.

Method 2: Set environment variables manually

If you want to launch Claude Code directly or set up a permanent config, you’ll need three environment variables:

| Variable | Value | Why | |---|---|---| | ANTHROPIC_BASE_URL | http://localhost:11434 | Points Claude Code to your local Ollama server | | ANTHROPIC_AUTH_TOKEN | ollama | Required — Ollama ignores the value but Claude Code needs it set | | ANTHROPIC_API_KEY | "" (empty string) | Must be explicitly empty — prevents Claude Code from falling back to a real API key |

Temporary (this session only):

bash

# macOS / Linux / WSL
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

claude

Permanent (all projects):

Create or edit ~/.claude/config.json:

json

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  },
  "model": "glm-4.7-flash"
}

Per-project (recommended):

If you only want local models for specific projects, add a .claude/settings.json in that project’s root:

json

{
  "model": "glm-4.7-flash",
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  }
}

Add it to .gitignore if you don’t want teammates picking up local config:

bash

echo ".claude/settings.json" >> .gitignore

How do you fix the context window size?

This is the step most guides skip, and it’s the most common reason Claude Code fails mid-task.

Ollama defaults to a 2K–4K token context window. Claude Code reads entire project files, runs tests, edits multiple files — it needs at least 64K tokens to work without truncation.

Create a Modelfile that enforces a larger context:

bash

mkdir -p ~/.ollama/Modelfiles

cat > ~/.ollama/Modelfiles/claude-code-64k <<'EOF'
FROM glm-4.7-flash

PARAMETER num_ctx 65536
PARAMETER temperature 0.7
EOF

Build and use it:

bash

ollama create claude-code-64k -f ~/.ollama/Modelfiles/claude-code-64k

# Now launch with this model variant:
ollama launch claude --model claude-code-64k

Verify it’s working:

bash

ollama list
# Should show: claude-code-64k ... 9 GB

claude --model claude-code-64k

The scenario: You ask Claude Code to refactor a 500-line module. With the default 4K context, it sees the first few files, loses track of the rest, and starts hallucinating variable names that don’t exist. With 64K context, it holds the entire module in memory and gives you accurate edits. The difference between a useful session and a frustrating one.

How do you launch Claude Code with your local model?

You have options:

Automatic (easiest):

bash

ollama launch claude --model glm-4.7-flash

Ollama sets everything and opens Claude Code in your current directory.

Manual with env vars:

bash

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

claude

One-shot (no REPL):

bash

claude -p "Explain this function" < src/utils/auth.ts

On first launch, Claude Code asks permission to access files in the current directory. Type yes to proceed.

How do you configure Claude Code for maximum privacy?

By default, Claude Code sends telemetry. To keep your setup fully offline and private, disable non-essential traffic:

In ~/.claude/config.json or .claude/settings.json:

json

{
  "model": "glm-4.7-flash",
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

This ensures all inference runs locally and no usage data leaves your machine.

What works and what doesn’t with Ollama + Claude Code?

Ollama’s Anthropic compatibility layer handles most of what Claude Code needs, but not everything. As of 2026:

Works:

Multi-turn conversation (Messages API)
Streaming responses
System prompts
Tool calling — file reads, file edits, bash execution
Temperature, top_p, stop sequences
Vision (base64 images)

Doesn’t work:

tool_choice (forced tool selection) — Claude Code falls back to auto when needed
Prompt caching — no speed benefit from repeated identical prompts
URL-referenced images — only base64 works

In practice: file editing, bash execution, test generation, code review, refactoring — all work correctly. The missing tool_choice means Claude Code occasionally picks the wrong tool on first try, then self-corrects. Not a blocker.

How do you run multiple models?

Switch models to compare outputs on the same task:

bash

# Task with fast model
ollama launch claude --model glm-4.7-flash
# (run your task, observe)

# Same task with reasoning model
ollama launch claude --model qwen2.5-coder:32b
# (compare results)

Or build model variants with different settings:

bash

cat > ~/.ollama/Modelfiles/coder-lowtemp <<'EOF'
FROM qwen2.5-coder:7b

PARAMETER num_ctx 65536
PARAMETER temperature 0.3
EOF

ollama create coder-lowtemp -f ~/.ollama/Modelfiles/coder-lowtemp
ollama launch claude --model coder-lowtemp

Lower temperature gives more deterministic output — better for code generation, worse for creative exploration.

How do you troubleshoot connection errors?

“Command not found: ollama”

Ollama isn’t in your PATH. Restart your terminal. If that doesn’t work, find the installation and add it to your PATH:

bash

which ollama   # should return a path

# On macOS, Ollama installs to /usr/local/bin
# If missing, reinstall from ollama.com

Claude Code says it can’t access models

Check your environment variables:

bash

echo $ANTHROPIC_BASE_URL
# Should print: http://localhost:11434

echo $ANTHROPIC_AUTH_TOKEN
# Should print: ollama

If empty, re-export the variables:

bash

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

Model is slow or hangs

Check if the model finished downloading:

bash

ollama list

If it shows “downloading” instead of a size, wait for it to complete.

Verify GPU is being used:

bash

ollama ps

Look for the model name — if it’s using CPU instead of GPU, check your drivers.

Use a smaller model:

bash

ollama pull qwen2.5-coder:7b
ollama launch claude --model qwen2.5-coder:7b

Out of memory errors

Your model is too large for your hardware. Switch to a smaller model or use a quantized variant:

bash

# Pull a quantized version
ollama pull glm-4.7-flash:q4_K_M

# Or use a smaller model entirely
ollama pull qwen2.5-coder:7b

Port 11434 already in use

Something else is using Ollama’s port. Change Ollama’s port:

bash

OLLAMA_HOST=localhost:11435 ollama serve

Then update your config to use :11435.

ANTHROPIC_API_KEY error

If Claude Code still tries to authenticate after setting env vars, make sure ANTHROPIC_API_KEY is explicitly set to an empty string — not omitted:

bash

export ANTHROPIC_API_KEY=""

In config.json, this must be "ANTHROPIC_API_KEY": "" (empty string), not missing entirely.

What are the real costs and savings?

Cloud API costs add up fast:

Anthropic Claude API: ~$3 per 1M input tokens
OpenAI GPT-4: ~$30 per 1M input tokens
Local Ollama: $0 (hardware only, one-time)

For a developer using Claude Code 8 hours a day:

plaintext

Cloud (Anthropic): ~$100/month
Local (Ollama): $0/month (after hardware purchase)
Savings: 100%

The hardware costs more upfront, but if you’re using Claude Code heavily, it pays for itself within a few months. And you get offline capability, full data privacy, and the ability to work on planes, in cafes, or air-gapped environments.

Summary

Install Ollama, pull a model (glm-4.7-flash is the best starting point)
Launch with ollama launch claude --model <model> (Ollama v0.14.5+) or set env vars manually
Three env vars: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
Always set context to 64K+ via Modelfile — default is too small for Claude Code
glm-4.7-flash for speed, qwen2.5-coder:32b for quality, qwen2.5-coder:7b for low-VRAM
Tool calling, file editing, bash execution all work; tool_choice and prompt caching don’t
Scope config to .claude/settings.json per-project, add to .gitignore if needed

Frequently Asked Questions

Is this actually private?

Yes. With all three env vars set and telemetry disabled, all inference runs on your local machine. No data is sent to Anthropic, OpenAI, or anyone else. Your code never leaves your machine.

Which model is best for coding?

For most coding tasks on a machine with 16 GB RAM or an 8 GB GPU: glm-4.7-flash. Fast, 128K context, handles tool calling well.

If you have a 24 GB+ GPU and want better reasoning on complex refactors: qwen2.5-coder:32b.

For low-VRAM machines (8 GB or less): qwen2.5-coder:7b.

Does this work on Apple Silicon?

Yes. Ollama uses Metal automatically on M1/M2/M3/M4 — no configuration needed. The glm-4.7-flash model runs at ~25 tokens/second on M2 Pro (16GB). M3 Max (36GB+) handles larger models comfortably.

Can I switch back to Claude API for a specific task?

Yes. Use env vars to override for one session:

bash

ANTHROPIC_BASE_URL=""
ANTHROPIC_API_KEY="sk-ant-..." 
claude --model claude-sonnet-4-6

Or use the /model slash command mid-session — just note that /model changes the model name, not the base URL. Clear ANTHROPIC_BASE_URL first if you want to switch to cloud.

What’s the difference between this and the Gemma 4 guide?

The Gemma 4 + Claude Code guide covers Google models specifically. This guide covers all Ollama models and focuses on the setup process (env vars, Modelfiles, troubleshooting) rather than a specific model family. The mechanics are the same regardless of which model you choose.

Claude Code + Ollama: Free Local AI Coding Setup (2026)

Prerequisites

How do you install Ollama?

Which model should you pull for Claude Code?

How do you configure Claude Code to use Ollama?

Method 1: `ollama launch` (easiest, Ollama v0.14.5+)

Method 2: Set environment variables manually

How do you fix the context window size?

How do you launch Claude Code with your local model?

How do you configure Claude Code for maximum privacy?

What works and what doesn’t with Ollama + Claude Code?

How do you run multiple models?

How do you troubleshoot connection errors?

What are the real costs and savings?

Summary

Frequently Asked Questions

Is this actually private?

Which model is best for coding?

Does this work on Apple Silicon?

Can I switch back to Claude API for a specific task?

What’s the difference between this and the Gemma 4 guide?

What to Read Next

Related Articles

How to Use Gemma 4 with Claude Code via Ollama (April 2026)

How to Install Gemma 4 Locally with Ollama (2026 Guide)

How to Install Ollama and Run LLMs Locally

Support Us

Prerequisites

How do you install Ollama?

Which model should you pull for Claude Code?

How do you configure Claude Code to use Ollama?

Method 1: ollama launch (easiest, Ollama v0.14.5+)

Method 2: Set environment variables manually

How do you fix the context window size?

How do you launch Claude Code with your local model?

How do you configure Claude Code for maximum privacy?

What works and what doesn’t with Ollama + Claude Code?

How do you run multiple models?

How do you troubleshoot connection errors?

What are the real costs and savings?

Summary

Frequently Asked Questions

Is this actually private?

Which model is best for coding?

Does this work on Apple Silicon?

Can I switch back to Claude API for a specific task?

What’s the difference between this and the Gemma 4 guide?

What to Read Next

Related Articles

How to Use Gemma 4 with Claude Code via Ollama (April 2026)

How to Install Gemma 4 Locally with Ollama (2026 Guide)

How to Install Ollama and Run LLMs Locally

Before you go...

Support Us

Method 1: `ollama launch` (easiest, Ollama v0.14.5+)