RAG is one of those terms that gets thrown around a lot in AI conversations, often without a clear explanation of what it actually does or when it’s worth the effort.
Here’s a plain explanation.
The problem RAG solves
Language models are trained on data up to a cutoff date. After that, they don’t know anything new. They also don’t know anything that was never public — your company’s internal docs, your product’s knowledge base, your customer’s account history.
If you ask Claude or GPT-4 “what’s our refund policy?”, they can’t answer. Not because they’re not smart enough — because they’ve never seen your refund policy.
You have two options:
- Include the relevant information in the prompt — works fine until you have 500 pages of documentation and can’t fit it all
- Retrieve only the relevant parts and include those — this is RAG
What RAG does
RAG stands for Retrieval-Augmented Generation. The name describes exactly what it does:
- Retrieval — find the relevant pieces of information from your knowledge base
- Augmented Generation — include those pieces in the prompt and let the model answer using them
Instead of asking the model “what’s our refund policy?”, you:
- Search your documentation for chunks related to “refund policy”
- Put those chunks in the prompt: “Here is our documentation: [chunk 1, chunk 2]. Based on this, what is our refund policy?”
- The model answers using the documentation you provided
The model doesn’t need to “know” the answer from training. It reads the documentation you give it and answers from that.
A simple example
Without RAG:
User: What's the cancellation policy for annual subscriptions?
AI: I don't have information about your specific company's cancellation policy.
With RAG:
[System retrieves chunk from docs: "Annual subscriptions can be cancelled within 30 days
for a full refund. After 30 days, the subscription runs until the end of the billing period
with no partial refund."]
User: What's the cancellation policy for annual subscriptions?
AI: Annual subscriptions can be cancelled within 30 days of purchase for a full refund.
After 30 days, the subscription will continue until the end of your billing period
and no partial refund will be issued.
Same model. Different result, because of what you put in the prompt.
How the retrieval part works
This is where it gets slightly more technical. You can’t just search documentation with keyword search and hope to get the right chunks — “refund” might appear in 50 different places, and you need the relevant ones.
The standard approach: embeddings + vector search.
Embeddings convert text into a list of numbers (a vector) that represents the meaning of the text. Two pieces of text that mean similar things will have similar vectors, even if they use different words.
Vector search finds the vectors most similar to your query vector. So “what’s your cancellation policy?” will find chunks about “subscription termination” and “refund terms” even if they don’t use the word “cancellation.”
The pipeline:
-
Preparation (one time):
- Split your documents into chunks (e.g., 500 tokens each)
- Send each chunk to an embedding model
- Store the embedding vectors in a vector database
-
At query time:
- Take the user’s question
- Convert it to an embedding vector
- Find the top N most similar chunks in your vector database
- Put those chunks in the prompt
- Send to the LLM
When RAG actually helps
Private or internal knowledge: Your knowledge base, internal docs, product manuals, support articles. Anything the model can’t have seen in training.
Up-to-date information: News, recent product changes, anything that happened after the model’s training cutoff.
Large knowledge bases: If you have 1000 pages of documentation, you can’t put it all in a prompt. RAG lets you retrieve only the 3-5 pages relevant to the question.
Reducing hallucinations: When the model has relevant source material to reference, it’s less likely to make things up. You can also ask it to cite sources, making it easier to verify.
When RAG doesn’t help (or isn’t needed)
Small, stable knowledge: If your knowledge base is 10 pages and rarely changes, just put it in the system prompt. No need for the retrieval machinery.
Questions that require reasoning, not retrieval: “Explain the trade-offs between SQL and NoSQL” doesn’t need a knowledge base. The model already knows this.
Real-time data: RAG retrieves from a pre-built knowledge base. If you need live data (stock prices, current weather, live inventory), you need tool use / function calling, not RAG.
When your chunks are too small or too large: Chunk size matters. Too small and you miss context. Too large and you include irrelevant noise. Getting chunking right is more of an art than it looks.
The minimal working example (Python)
from anthropic import Anthropic
import numpy as np
client = Anthropic()
# Pretend these are your document chunks
chunks = [
"Annual subscriptions can be cancelled within 30 days for a full refund.",
"Monthly subscriptions renew automatically. Cancel anytime before renewal.",
"Enterprise contracts are governed by the signed agreement.",
]
# In a real system, these would be pre-computed and stored in a vector DB
# Here we fake it with a simple keyword match for illustration
def retrieve(query: str, chunks: list[str], top_k: int = 2) -> list[str]:
# Real implementation: embed query, search vector DB
# Simplified: return chunks that contain any query word
query_words = query.lower().split()
scored = []
for chunk in chunks:
score = sum(1 for w in query_words if w in chunk.lower())
scored.append((score, chunk))
scored.sort(reverse=True)
return [chunk for _, chunk in scored[:top_k]]
def answer_with_rag(question: str) -> str:
relevant_chunks = retrieve(question, chunks)
context = "\n".join(f"- {c}" for c in relevant_chunks)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=f"Answer the user's question based only on the provided documentation. If the answer isn't in the docs, say so.\n\nDocumentation:\n{context}",
messages=[{"role": "user", "content": question}],
)
return response.content[0].text
print(answer_with_rag("Can I cancel my annual subscription?"))
# Annual subscriptions can be cancelled within 30 days for a full refund.
The realistic complexity
This example is simple. Production RAG is harder:
- Chunking strategy matters — sentence boundaries, semantic chunking, overlap between chunks
- Embedding model quality — better models mean better retrieval
- Reranking — the top-k by vector similarity isn’t always the most relevant; a reranker can improve precision
- Context window management — you can only fit so many chunks; too many and you dilute the signal
- Evaluation — hard to know if your RAG is actually retrieving the right things without systematic testing
Most teams underestimate how much work good RAG takes to get right. The basic setup is quick. Making it accurate for real questions in production is an ongoing project.
The one-sentence version
RAG is just: find the relevant documents, put them in the prompt, ask the model to answer using them. Everything else is engineering details around how to find the right documents efficiently.
If you remember that, the rest is just implementation.
See also: Claude API & Code Cheat Sheet for embedding and tool use API patterns.