You’ve got Hermes + Ollama running. It works. Now make it fast.
This article covers optimization: model selection, GPU tuning, quantization strategies, and caching so your local AI feels like cloud speed without any API costs.
Part 1: Model Selection Strategy
Not all models are equal. Here’s how to pick right.
Benchmark: The Model Pyramid
Quality/Power Speed/Memory
(higher = better) (lower = better)
Dolphin-Mixtral Very slow, 50GB VRAM ❌
Llama2-70B Slow, 40GB VRAM
Mistral-7B Medium, 4GB VRAM ✅ (sweet spot)
Orca-Mini-3B Fast, 1.5GB VRAM
Tiny-LLaMA-1B Very fast, 0.5GB VRAM
Real Benchmarks (Tokens/Second)
| Model | CPU | GPU (4GB) | Quality |
|---|---|---|---|
| Mistral 7B | 1 tok/s | 15 tok/s | Excellent |
| Neural-Chat 7B | 0.8 tok/s | 12 tok/s | Good |
| Orca-Mini 3B | 2 tok/s | 20 tok/s | Good |
| TinyLLaMA 1B | 5 tok/s | 30 tok/s | Okay |
Recommendation:
- Team/production: Mistral 7B (best balance)
- Speed priority: Orca-Mini 3B
- Quality priority: Dolphin-Mixtral (if you have GPU)
- Budget priority: TinyLLaMA 1B (still decent)
Task-Specific Model Selection
# ~/.hermes/config.yml
models:
quick_lookup:
model: "orca-mini:3b" # 3 seconds
use_for: ["simple_qa", "lookup"]
general:
model: "mistral:7b" # 10 seconds
use_for: ["general_tasks", "conversations"]
complex:
model: "dolphin-mixtral" # 30 seconds
use_for: ["reasoning", "coding", "analysis"]
Hermes can auto-select based on complexity.
Part 2: GPU Optimization
GPU makes a 10-15x difference in speed. Here’s how to maximize it.
NVIDIA GPU (CUDA)
# Check if NVIDIA GPU detected
nvidia-smi
# Install NVIDIA CUDA toolkit (if needed)
brew install nvidia-cuda-toolkit # macOS
# Linux: Follow nvidia docs
Ollama config:
ollama:
gpu_layer_split: 20 # Use GPU for first 20 layers
gpu_memory: 4096 # MB to allocate to GPU
AMD GPU (ROCm)
# Install ROCm
# AMD: Follow amd.com ROCm docs
# Test detection
rocm-smi
Apple GPU (Metal)
# Metal acceleration (automatic on Apple Silicon)
# Just works, no config needed
# M1/M2/M3 chips get automatic GPU boost
Measuring GPU Impact
# Before GPU tuning
time ollama run mistral "Write 500 words on AI"
# ~30 seconds
# After GPU tuning
# ~5 seconds (6x faster!)
Part 3: Quantization Strategy
Quantization reduces model size dramatically with minimal quality loss.
Quantization Levels
Q2 (Extremely Quantized)
Size: 1.5 GB
Speed: Blazing fast
Quality: Lower
Use: Low-resource devices
Q4 (Balanced) ⭐
Size: 3.8 GB
Speed: Fast
Quality: Very good
Use: Most deployments
Q8 (Minimal Quantization)
Size: 6 GB
Speed: Slower
Quality: Excellent
Use: Quality-critical tasks
FP16 (No Quantization)
Size: 13 GB
Speed: Slowest
Quality: Perfect
Use: When VRAM unlimited
Real Metrics
| Model | Version | Size | Speed | Quality |
|---|---|---|---|---|
| Mistral | FP16 | 13 GB | 1 tok/s | 10/10 |
| Mistral | Q8 | 6.5 GB | 2 tok/s | 9.8/10 |
| Mistral | Q4 | 3.8 GB | 5 tok/s | 9.5/10 |
| Mistral | Q2 | 1.9 GB | 15 tok/s | 8.5/10 |
Recommendation: Q4 is sweet spot (9.5/10 quality, 5x speed).
Using Quantized Models
# Download quantized version
ollama pull mistral:q4 # 3.8 GB, Q4 quantization
ollama pull mistral:q2 # 1.9 GB, Q2 quantization
# Configure Hermes
hermes setup
# Choose: mistral:q4
Part 4: Caching & Response Optimization
Smart caching saves tokens and money.
Prompt Caching
# ~/.hermes/config.yml
inference:
cache_prompts: true
cache_ttl: 86400 # 24 hours
cache_backend: "redis" # or "local"
Example:
User 1: "Summarize the Hermes docs"
→ Hermes generates summary (10 seconds)
→ Stores in cache
User 2 (5 minutes later): "Summarize the Hermes docs"
→ Hermes retrieves from cache (instant)
→ Saves 10 seconds, same result
Caching works great for repeated requests.
Response Chunking
Instead of waiting for full response, stream tokens:
inference:
streaming: true
chunk_size: 50 # Send 50 tokens at a time
User feels instant feedback (first token in 1s) instead of waiting for full response (10s).
Context Window Optimization
inference:
context_window: 4096 # Balance: 2048=fast, 8192=accurate
Larger context remembers more but is slower. Default 4096 is good.
Part 5: Batch Processing
Process multiple requests efficiently.
# Slow (separate requests, 30 seconds each)
hermes "Question 1"
hermes "Question 2"
hermes "Question 3"
# Fast (batched, 30 seconds total)
hermes --batch << EOF
Question 1
Question 2
Question 3
EOF
Batching reduces per-request overhead.
Part 6: Production Ollama Deployment
Kubernetes Deployment
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
data:
ollama-config.yaml: |
models:
- name: mistral:q4
preload: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "4Gi"
nvidia.com/gpu: "1" # GPU
limits:
memory: "8Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
Multi-Ollama Load Balancing
Load Balancer (HAProxy)
├─ Ollama Instance 1 (GPU)
├─ Ollama Instance 2 (GPU)
└─ Ollama Instance 3 (GPU)
All Hermes agents → Load Balancer
Distributes requests across multiple Ollama servers.
Part 7: Monitoring & Observability
Track performance over time.
Key Metrics
# Response time
hermes metrics latency
# Output: p50=2.3s, p99=8.1s
# Token generation speed
hermes metrics throughput
# Output: 5.2 tokens/second
# Model usage
hermes metrics models
# Output: mistral: 450 uses, dolphin: 50 uses
Prometheus Export
# ollama-prometheus.yml
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['localhost:11434']
metrics_path: '/metrics'
Visualize in Grafana dashboard.
Cost Calculator
If comparing to OpenAI:
Ollama cost:
Hardware: $500 one-time
Electricity: $30/month
Total year 1: $860
OpenAI cost (100 requests/day):
Mistral via OpenRouter: $0.15/request
$0.15 * 100 * 365 = $5,475/year
Savings with Ollama: $4,615 year 1
Payback in ~2 months.
Part 8: Real-World Production Config
# ~/.hermes/config.yml (Production, Ollama-optimized)
llm:
provider: "ollama"
endpoint: "http://ollama-internal.company.com:11434"
# Model strategy
models:
fast:
name: "orca-mini:q4"
use_for: ["quick_lookup", "simple_qa"]
default:
name: "mistral:q4" # Best overall
use_for: ["general_tasks", "reasoning"]
powerful:
name: "dolphin-mixtral:q8"
use_for: ["complex_analysis", "coding"]
# Performance tuning
streaming: true
batch_mode: true
chunk_size: 50
context_window: 8192
# Caching
cache_prompts: true
cache_ttl: 86400
# GPU tuning (NVIDIA)
gpu_layers: 20
gpu_memory: 4096
# Multi-platform
platforms:
discord:
learn_from_conversations: true
slack:
learn_from_messages: true
telegram:
learn_from_chats: true
# Monitoring
monitoring:
prometheus: true
metrics_port: 9090
track_latency: true
Tuning Checklist
- Download appropriate quantized model (Q4 for balance)
- Enable GPU acceleration (
gpu_layers: 20) - Enable streaming responses (
streaming: true) - Enable prompt caching (
cache_prompts: true) - Set context window (8192 for complex tasks)
- Enable batch processing (
batch_mode: true) - Monitor latency (
hermes metrics latency) - Measure tokens/second (
hermes metrics throughput) - Load test (simulate real usage)
Performance Targets
After optimization:
- First token: < 2 seconds
- Average speed: 5+ tokens/second
- P99 latency: < 10 seconds
- Concurrency: 4-8 simultaneous requests
If not meeting targets, revisit GPU/quantization settings.
FAQ
Q: What’s the best model for Hermes? Mistral Q4. 3.8GB, fast, high quality, widely available.
Q: Should I always use GPU? Yes, if you have one. 10x speed improvement.
Q: Is Q4 good enough for production? Absolutely. Imperceptible quality loss compared to FP16.
Q: How do I know if GPU is actually being used?
Run nvidia-smi while Ollama is generating. GPU % should be high.
Q: Can I run multiple models simultaneously? Yes, but they compete for VRAM. Run largest on GPU, others on CPU.
Q: What about latency spike under load?
Reduce max_concurrent or add more instances behind load balancer.
What to Read Next
- Basic Ollama Setup — If you haven’t set up yet
- Advanced Hermes Config — Multi-platform scaling
- Performance Troubleshooting — If optimization doesn’t help
Optimized Ollama + Hermes feels like cloud speed, costs like self-hosted, and keeps your data completely private.
3x faster, 100x cheaper. That’s the optimization game.
Related Articles
Deepen your understanding with these curated continuations.
Advanced Hermes Agent: Optimization, Scaling & Learning Loop Tuning
Make your Hermes Agent production-grade. Optimize the learning loop, scale to thousands of users, and tune every parameter.
Run Hermes Agent with Ollama: Complete Local LLM Setup
Connect Hermes Agent to Ollama for free, private AI inference. Zero API costs, complete data privacy, fast local inference.
How Hermes Agent Learns: Architecture & The Self-Improvement Loop
Most AI agents forget everything. Here's exactly how Hermes remembers, learns, and gets smarter with every task it solves.