MeshWorld India Logo MeshWorld.
Claude Code Ollama Local AI Developer Tools Privacy How-To 11 min read

Claude Code + Ollama: Free Local AI Coding Setup (2026)

Darsh Jariwala
By Darsh Jariwala
Claude Code + Ollama: Free Local AI Coding Setup (2026)

Claude Code connects to either Anthropic’s cloud or a local Ollama model. Swap in Ollama and you get zero API costs, full offline capability, and your code never leaves your machine. This guide walks through the complete setup for 2026 — installation, env vars, model selection, and the gotchas that’ll trip you up if you don’t know them.

TL;DR
  • Install Ollama, pull a model, then run ollama launch claude --model <model> (Ollama v0.14.5+)
  • Or set env vars manually: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
  • Recommended first model: glm-4.7-flash (9GB, 128K context, fast) or qwen2.5-coder:7b (4GB, low-VRAM)
  • Claude Code needs 64K+ context window — create a Modelfile or use ollama launch to auto-configure
  • Tool calling, file editing, bash execution all work locally

Prerequisites

Before you start:

  • macOS, Linux, or Windows (WSL recommended on Windows)
  • 8 GB RAM minimum (16 GB recommended for larger models)
  • 20 GB free disk space
  • GPU (optional but strongly recommended): Apple Silicon, NVIDIA, or AMD. GPU inference is 5–20x faster than CPU.

Check your GPU if you have one:

bash
nvidia-smi

You should see your GPU listed with VRAM and CUDA version. If not, Ollama falls back to CPU — still works, just slower.


How do you install Ollama?

Ollama runs as a background service that manages your models and exposes an HTTP API. Once it’s running, Claude Code talks to it the same way it talks to Anthropic’s servers.

macOS / Windows:

Download the installer from ollama.com. Run it and Ollama starts automatically as a background service.

Linux (single command):

bash
curl -fsSL https://ollama.com/install.sh | sh

Verify the install:

bash
ollama --version
# ollama version 0.14.5 or later
Warning

Ollama v0.14.0 or later is required for Claude Code compatibility. Older versions don’t expose the Anthropic Messages API correctly. If you’re on an older version, upgrade with brew upgrade ollama (macOS) or re-run the install script (Linux).


Which model should you pull for Claude Code?

The model is your AI brain. Claude Code works with any Ollama model, but different models have different strengths.

Fastest setup — recommended for most people:

bash
ollama pull glm-4.7-flash

glm-4.7-flash is fast (~25 tokens/second on GPU), has a 128K token context window, and handles tool calling well. At ~9 GB it’s the sweet spot for machines with 16 GB RAM or an 8 GB VRAM GPU.

Best for complex refactoring:

bash
ollama pull qwen2.5-coder:32b

qwen2.5-coder:32b is the strongest open coding model. Best results for multi-file refactors and architectural decisions, but needs ~20 GB VRAM. Only on RTX 4090, 3090, or M3 Max+.

Low-VRAM machines (8 GB VRAM or less):

bash
ollama pull qwen2.5-coder:7b

~4 GB download, ~5 GB VRAM needed. Runs on CPU if no GPU available — slow (2–5 tokens/second) but functional.

Model comparison:

ModelSizeVRAMContextBest for
glm-4.7-flash~9 GB11 GB128KFirst-time users, fast tasks
qwen2.5-coder:7b~4 GB5 GB32KLow-VRAM, quick tasks
qwen2.5-coder:32b~18 GB20 GB32KComplex refactoring, multi-file
glm-4.7:cloud128KNo GPU? Use hybrid

Test your model after pulling:

bash
ollama run glm-4.7-flash "Hello, what model are you?"

You should get a response in 1–5 seconds. Slow? Check that your GPU is being used (nvidia-smi on NVIDIA, Activity Monitor on macOS) and that nothing else is hogging RAM.


How do you configure Claude Code to use Ollama?

Here are two ways to connect Claude Code to your local model.

Method 1: ollama launch (easiest, Ollama v0.14.5+)

If you have Ollama v0.14.5 or later, the easiest path is:

bash
ollama launch claude --model glm-4.7-flash

This automatically sets the environment variables and opens Claude Code in your current directory. Done.

Method 2: Set environment variables manually

If you want to launch Claude Code directly or set up a permanent config, you’ll need three environment variables:

VariableValueWhy
ANTHROPIC_BASE_URLhttp://localhost:11434Points Claude Code to your local Ollama server
ANTHROPIC_AUTH_TOKENollamaRequired — Ollama ignores the value but Claude Code needs it set
ANTHROPIC_API_KEY"" (empty string)Must be explicitly empty — prevents Claude Code from falling back to a real API key

Temporary (this session only):

bash
# macOS / Linux / WSL
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

claude

Permanent (all projects):

Create or edit ~/.claude/config.json:

json
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  },
  "model": "glm-4.7-flash"
}

Per-project (recommended):

If you only want local models for specific projects, add a .claude/settings.json in that project’s root:

json
{
  "model": "glm-4.7-flash",
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  }
}

Add it to .gitignore if you don’t want teammates picking up local config:

bash
echo ".claude/settings.json" >> .gitignore

How do you fix the context window size?

This is the step most guides skip, and it’s the most common reason Claude Code fails mid-task.

Ollama defaults to a 2K–4K token context window. Claude Code reads entire project files, runs tests, edits multiple files — it needs at least 64K tokens to work without truncation.

Create a Modelfile that enforces a larger context:

bash
mkdir -p ~/.ollama/Modelfiles

cat > ~/.ollama/Modelfiles/claude-code-64k <<'EOF'
FROM glm-4.7-flash

PARAMETER num_ctx 65536
PARAMETER temperature 0.7
EOF

Build and use it:

bash
ollama create claude-code-64k -f ~/.ollama/Modelfiles/claude-code-64k

# Now launch with this model variant:
ollama launch claude --model claude-code-64k

Verify it’s working:

bash
ollama list
# Should show: claude-code-64k ... 9 GB

claude --model claude-code-64k

The scenario: You ask Claude Code to refactor a 500-line module. With the default 4K context, it sees the first few files, loses track of the rest, and starts hallucinating variable names that don’t exist. With 64K context, it holds the entire module in memory and gives you accurate edits. The difference between a useful session and a frustrating one.


How do you launch Claude Code with your local model?

You have options:

Automatic (easiest):

bash
ollama launch claude --model glm-4.7-flash

Ollama sets everything and opens Claude Code in your current directory.

Manual with env vars:

bash
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

claude

One-shot (no REPL):

bash
claude -p "Explain this function" < src/utils/auth.ts

On first launch, Claude Code asks permission to access files in the current directory. Type yes to proceed.


How do you configure Claude Code for maximum privacy?

By default, Claude Code sends telemetry. To keep your setup fully offline and private, disable non-essential traffic:

In ~/.claude/config.json or .claude/settings.json:

json
{
  "model": "glm-4.7-flash",
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

This ensures all inference runs locally and no usage data leaves your machine.


What works and what doesn’t with Ollama + Claude Code?

Ollama’s Anthropic compatibility layer handles most of what Claude Code needs, but not everything. As of 2026:

Works:

  • Multi-turn conversation (Messages API)
  • Streaming responses
  • System prompts
  • Tool calling — file reads, file edits, bash execution
  • Temperature, top_p, stop sequences
  • Vision (base64 images)

Doesn’t work:

  • tool_choice (forced tool selection) — Claude Code falls back to auto when needed
  • Prompt caching — no speed benefit from repeated identical prompts
  • URL-referenced images — only base64 works

In practice: file editing, bash execution, test generation, code review, refactoring — all work correctly. The missing tool_choice means Claude Code occasionally picks the wrong tool on first try, then self-corrects. Not a blocker.


How do you run multiple models?

Switch models to compare outputs on the same task:

bash
# Task with fast model
ollama launch claude --model glm-4.7-flash
# (run your task, observe)

# Same task with reasoning model
ollama launch claude --model qwen2.5-coder:32b
# (compare results)

Or build model variants with different settings:

bash
cat > ~/.ollama/Modelfiles/coder-lowtemp <<'EOF'
FROM qwen2.5-coder:7b

PARAMETER num_ctx 65536
PARAMETER temperature 0.3
EOF

ollama create coder-lowtemp -f ~/.ollama/Modelfiles/coder-lowtemp
ollama launch claude --model coder-lowtemp

Lower temperature gives more deterministic output — better for code generation, worse for creative exploration.


How do you troubleshoot connection errors?

“Command not found: ollama”

Ollama isn’t in your PATH. Restart your terminal. If that doesn’t work, find the installation and add it to your PATH:

bash
which ollama   # should return a path

# On macOS, Ollama installs to /usr/local/bin
# If missing, reinstall from ollama.com

Claude Code says it can’t access models

Check your environment variables:

bash
echo $ANTHROPIC_BASE_URL
# Should print: http://localhost:11434

echo $ANTHROPIC_AUTH_TOKEN
# Should print: ollama

If empty, re-export the variables:

bash
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

Model is slow or hangs

  1. Check if the model finished downloading:
bash
ollama list

If it shows “downloading” instead of a size, wait for it to complete.

  1. Verify GPU is being used:
bash
ollama ps

Look for the model name — if it’s using CPU instead of GPU, check your drivers.

  1. Use a smaller model:
bash
ollama pull qwen2.5-coder:7b
ollama launch claude --model qwen2.5-coder:7b

Out of memory errors

Your model is too large for your hardware. Switch to a smaller model or use a quantized variant:

bash
# Pull a quantized version
ollama pull glm-4.7-flash:q4_K_M

# Or use a smaller model entirely
ollama pull qwen2.5-coder:7b

Port 11434 already in use

Something else is using Ollama’s port. Change Ollama’s port:

bash
OLLAMA_HOST=localhost:11435 ollama serve

Then update your config to use :11435.

ANTHROPIC_API_KEY error

If Claude Code still tries to authenticate after setting env vars, make sure ANTHROPIC_API_KEY is explicitly set to an empty string — not omitted:

bash
export ANTHROPIC_API_KEY=""

In config.json, this must be "ANTHROPIC_API_KEY": "" (empty string), not missing entirely.


What are the real costs and savings?

Cloud API costs add up fast:

  • Anthropic Claude API: ~$3 per 1M input tokens
  • OpenAI GPT-4: ~$30 per 1M input tokens
  • Local Ollama: $0 (hardware only, one-time)

For a developer using Claude Code 8 hours a day:

plaintext
Cloud (Anthropic): ~$100/month
Local (Ollama): $0/month (after hardware purchase)
Savings: 100%

The hardware costs more upfront, but if you’re using Claude Code heavily, it pays for itself within a few months. And you get offline capability, full data privacy, and the ability to work on planes, in cafes, or air-gapped environments.


Summary

  • Install Ollama, pull a model (glm-4.7-flash is the best starting point)
  • Launch with ollama launch claude --model <model> (Ollama v0.14.5+) or set env vars manually
  • Three env vars: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY=""
  • Always set context to 64K+ via Modelfile — default is too small for Claude Code
  • glm-4.7-flash for speed, qwen2.5-coder:32b for quality, qwen2.5-coder:7b for low-VRAM
  • Tool calling, file editing, bash execution all work; tool_choice and prompt caching don’t
  • Scope config to .claude/settings.json per-project, add to .gitignore if needed

Frequently Asked Questions

Is this actually private?

Yes. With all three env vars set and telemetry disabled, all inference runs on your local machine. No data is sent to Anthropic, OpenAI, or anyone else. Your code never leaves your machine.

Which model is best for coding?

For most coding tasks on a machine with 16 GB RAM or an 8 GB GPU: glm-4.7-flash. Fast, 128K context, handles tool calling well.

If you have a 24 GB+ GPU and want better reasoning on complex refactors: qwen2.5-coder:32b.

For low-VRAM machines (8 GB or less): qwen2.5-coder:7b.

Does this work on Apple Silicon?

Yes. Ollama uses Metal automatically on M1/M2/M3/M4 — no configuration needed. The glm-4.7-flash model runs at ~25 tokens/second on M2 Pro (16GB). M3 Max (36GB+) handles larger models comfortably.

Can I switch back to Claude API for a specific task?

Yes. Use env vars to override for one session:

bash
ANTHROPIC_BASE_URL=""
ANTHROPIC_API_KEY="sk-ant-..." 
claude --model claude-sonnet-4-6

Or use the /model slash command mid-session — just note that /model changes the model name, not the base URL. Clear ANTHROPIC_BASE_URL first if you want to switch to cloud.

What’s the difference between this and the Gemma 4 guide?

The Gemma 4 + Claude Code guide covers Google models specifically. This guide covers all Ollama models and focuses on the setup process (env vars, Modelfiles, troubleshooting) rather than a specific model family. The mechanics are the same regardless of which model you choose.