MeshWorld India Logo MeshWorld.
Hermes Agent Ollama Local LLM Privacy Setup 6 min read

Run Hermes Agent with Ollama: Complete Local LLM Setup

Jena
By Jena
Run Hermes Agent with Ollama: Complete Local LLM Setup

Want your Hermes Agent completely private? No API calls. No vendor lock-in. No monthly bills.

Run it locally with Ollama. Free inference, complete data privacy, fast responses.

Why Ollama + Hermes

Cost: $0/month (after hardware investment) Privacy: Nothing leaves your machine Speed: Lower latency than cloud APIs Control: You own everything

Hermes learns, Ollama runs the model, your data stays yours.

Prerequisites

  • Hermes Agent installed (see Article 2)
  • Ollama installed from ollama.ai
  • Hardware: 4GB RAM minimum (8GB recommended)
    • Optional: GPU with 4GB+ VRAM (much faster)

Step 1: Install Ollama

macOS:

brew install ollama

Linux:

curl https://ollama.ai/install.sh | sh

Windows: Download from ollama.ai

Step 2: Start Ollama Server

ollama serve

Keep this terminal open. Ollama runs on http://localhost:11434.

Step 3: Download a Model

In another terminal:

# Hermes-optimized models (recommended)
ollama pull mistral          # 7B, fast, good quality
ollama pull neural-chat      # 7B, conversational
ollama pull orca-mini        # 3B, minimal resources

# Or other popular models
ollama pull llama2           # 7B, general purpose
ollama pull dolphin-mixtral  # Larger, more powerful

Download time: 5-30 minutes depending on model size and internet speed.

Model choice guide:

  • First time? Pick mistral (balance of speed and quality)
  • Want faster? Pick orca-mini (3B, 1.5GB VRAM)
  • Want best quality? Pick dolphin-mixtral (larger, slower)

Step 4: Verify Ollama Model

curl http://localhost:11434/api/tags

Should return:

{
  "models": [
    {
      "name": "mistral:latest",
      "size": 3800789248
    }
  ]
}

Your model is downloaded and ready.

Step 5: Configure Hermes for Ollama

hermes setup

When prompted:

Choose LLM provider: local
Ollama endpoint: http://localhost:11434
Model name: mistral (or your chosen model)

That’s it. Hermes is now connected to your local Ollama.

Step 6: Test It Works

hermes

You should see the CLI prompt. Type:

What is machine learning?

Hermes will:

  1. Send your question to Ollama
  2. Ollama runs inference locally
  3. Returns answer to Hermes
  4. Hermes displays it

Completely local. Completely private.

Performance Expectations

Mistral (7B):

  • First token: 3-5 seconds
  • Subsequent tokens: 0.5-1 second each
  • Total response: 10-15 seconds

GPU-accelerated (4GB VRAM):

  • First token: 0.5 seconds
  • Subsequent tokens: 0.1 second each
  • Total response: 3-5 seconds

CPU only:

  • Slower (but usable)
  • Typical: 20-30 seconds per response

GPU makes a huge difference. Consider adding a GPU if budget allows.

Resource Usage

Disk Space:

Mistral 7B:        3.8 GB
Neural-Chat 7B:    4.1 GB
Orca-Mini 3B:      1.8 GB

Memory Usage (RAM + VRAM):

CPU mode:          4-6 GB
GPU mode (VRAM):   3-4 GB (for 7B model)

GPU Options:

  • NVIDIA: CUDA-enabled (recommended)
  • AMD: ROCm-enabled
  • Apple: Metal-accelerated
  • Intel: Limited GPU support

Configuring Ollama with Hermes

Your config file (~/.hermes/config.yml):

llm:
  provider: "ollama"
  endpoint: "http://localhost:11434"
  model: "mistral"
  
  # Performance tuning
  streaming: true
  context_window: 4096
  temperature: 0.7
  top_p: 0.9
  
  # Inference settings
  num_predict: 512
  num_threads: 8  # CPU threads to use
  num_gpu: 1      # GPU layers

Monitoring Ollama

Check what’s loaded:

curl http://localhost:11434/api/tags

Check memory usage:

# On the machine running Ollama
top    # macOS: Activity Monitor

Monitor response times:

# Simple benchmark
time ollama run mistral "Write a 100 word essay on AI"

Switching Models

# Download another model
ollama pull llama2

# Update Hermes config
nano ~/.hermes/config.yml
# Change: model: "llama2"

# Restart Hermes
hermes

Ollama handles model switching. No restart needed for Ollama itself.

Multi-Model Setup

Run multiple models simultaneously:

# ~/.hermes/config.yml
models:
  fast:
    provider: "ollama"
    endpoint: "http://localhost:11434"
    model: "mistral"    # Use for quick tasks
    
  powerful:
    provider: "ollama"
    endpoint: "http://localhost:11434"
    model: "dolphin-mixtral"  # Use for complex tasks

Hermes can automatically pick the right model based on task complexity.

Troubleshooting Ollama

”Connection refused"

# Check if Ollama is running
ps aux | grep ollama

# If not, start it
ollama serve

"Model not found"

# Check downloaded models
ollama list

# Download missing model
ollama pull mistral

"Out of memory”

Solutions:

  1. Use smaller model: orca-mini instead of dolphin-mixtral
  2. Quantize model: mistral:q4 instead of mistral:latest
  3. Increase VRAM: Add GPU or reduce other apps

”Very slow responses”

Check:

# Model loaded?
curl http://localhost:11434/api/tags

# GPU accelerated?
# For NVIDIA: Check nvidia-smi
nvidia-smi

# If GPU not used, add to config:
# num_gpu: 1

Real-World Scenario: Team Setup

Setup: 10-person team, all using Hermes + Ollama locally.

Architecture:

Each Team Member
  ├─ Hermes Agent (local)
  ├─ Ollama server (local, same machine)
  ├─ Model: Mistral (shared download)
  └─ Memory: Local ~./hermes/memory/

Cost: $0/month after initial hardware. Privacy: 100% (no data leaves office). Speed: Comparable to cloud (with GPU).

Scaling: If team needs faster inference, upgrade one machine to server hardware, run central Ollama, all machines connect to it.

Comparing Local vs. Cloud

FactorLocal OllamaCloud (OpenAI)
Cost$0/month$10-100/month
Privacy100% localSent to provider
Speed5-15s/response3-10s/response
Setup15 minInstant
Model choiceLimited (20+)Many more
QualityGood (7B models)Excellent (GPT-4)

For most team use cases, Ollama is worth it.

FAQ

Q: Which model should I use? Start with mistral. It’s fast and good quality.

Q: Do I need a GPU? No, but it helps. 10x faster with GPU (4GB+ VRAM).

Q: Can I switch models mid-conversation? Yes, but Hermes will forget context. Each model is independent.

Q: How much internet bandwidth? None during inference (completely local). Model download is one-time (3-8GB).

Q: Can I share one Ollama server across multiple Hermes instances? Yes. Run Ollama on central server, point all Hermes instances to it.

Q: What about quantized models? They’re smaller (faster, less VRAM) but slightly lower quality. q4 is usually best trade-off.


That’s it. Free, private AI inference. No API keys. No monthly bills. Just local power.

Your data, your model, your machine. Completely under your control.