RAG 101: what it is and why it matters

If you’ve spent any time around AI conversations in the last two years, you’ve heard the acronym RAG. It shows up in product pitches, architecture diagrams, and job postings. But if you’ve never built one — or never needed to — the term can feel like insider shorthand for something that should be simpler to explain.

It is simpler to explain. So let me do that.

The problem RAG solves

Large language models like Claude or GPT are trained on vast amounts of public text. They know a lot about the world in general. What they don’t know is your company’s internal documentation, your product specs, your customer support history, or the policy document that was updated last Tuesday.

You could fine-tune a model on your data, but that’s expensive, slow, and needs to be repeated every time the data changes. You could paste everything into the prompt, but there are limits to how much text a model can process at once — and costs scale with every token.

RAG takes a different approach. Instead of teaching the model your data, you fetch the relevant pieces at the moment the question is asked and hand them to the model alongside the question.

RAG doesn’t make the model smarter. It gives the model the right information at the right time.

That’s the entire idea. The rest is engineering.

How it works, step by step

The acronym stands for Retrieval-Augmented Generation. Three words, three phases.

Retrieval. When a user asks a question, the system searches a knowledge base — your documents, databases, wikis, whatever you’ve indexed — and pulls back the most relevant pieces. This is a search problem. The system needs to find the right paragraphs, not just the right documents.

Augmentation. The retrieved text gets combined with the user’s question into a single prompt. The model now has context it didn’t have before: specific, current information from your own sources. This is the “augmented” part — the model’s knowledge is augmented with external data at query time.

Generation. The model reads the combined prompt and generates an answer grounded in the retrieved context. Because it has the actual source material in front of it, the answer can be specific, accurate, and traceable back to real documents.

Think of it like a researcher who doesn’t memorise every book in the library. Instead, they know how to find the right shelf, pull the right chapter, and synthesise an answer from what they’ve read. The quality of the answer depends on the quality of what they found.

Why not just use a bigger context window?

Modern models can process hundreds of thousands of tokens. So why not dump all your documents into the prompt and skip the retrieval step entirely?

Three reasons.

Cost. Every token you send costs money. Sending your entire knowledge base with every question turns a cheap API call into an expensive one — and it adds up fast at scale.

Latency. More input means slower responses. Users asking simple questions shouldn’t wait while the model reads through thousands of irrelevant pages.

Accuracy. This one is counterintuitive, but more context can actually hurt. When a model has to find the relevant needle in a massive haystack of text, it sometimes latches onto the wrong section or blends contradictory information. Focused retrieval — giving the model just the relevant pieces — tends to produce better answers than giving it everything.

Retrieval is not a workaround for limited context. It’s a better architecture for accuracy, cost, and speed — even when context windows keep growing.

Where RAG shows up in the real world

The pattern is everywhere once you know what to look for.

Customer support. A user asks a question in a chatbot. The system retrieves relevant help articles and the model answers based on actual documentation instead of guessing. When the docs get updated, the answers improve automatically — no retraining needed.

Internal knowledge. An employee searches “what’s our parental leave policy in Germany?” across a fragmented intranet. RAG retrieves the right HR document and returns a direct answer with a source link.

Code assistance. A developer asks how a particular API works. The system retrieves the current API documentation and generates an answer that reflects the latest version — not the version the model was trained on.

Legal and compliance. A question about regulatory requirements pulls from the actual regulatory text and internal compliance guidelines, producing answers that are traceable and auditable.

The common thread: the data already exists, it changes over time, and people need answers from it without becoming search experts.

What makes a RAG system good or bad

Not all RAG implementations are equal. The difference between a demo and a production system usually comes down to retrieval quality.

A good system retrieves the right information for the question — not just text that looks similar, but text that actually answers what was asked. A bad system returns vaguely related chunks, and the model confidently generates an answer from the wrong material. The user sees a fluent, well-structured response and has no reason to doubt it.

The dangerous failure mode isn’t silence. It’s a plausible-sounding answer built on the wrong foundation.

This is why RAG is an engineering discipline, not a plug-and-play feature. How you split documents into chunks, how you index them, how you search, how you rank results — these decisions compound. Get them right and the system feels like it genuinely understands your data. Get them wrong and it’s an expensive autocomplete.

If you want to go deeper on the retrieval side, Darbit wrote about the specific failure modes and what actually fixes them.

Why this matters now

AI is moving from novelty to infrastructure. The organisations pulling ahead aren’t the ones with the fanciest models — they’re the ones connecting models to their own data effectively. RAG is the most practical, most proven pattern for doing that.

It doesn’t require retraining models. It doesn’t require giving a third party your data. It works with the documents and systems you already have. And when those documents change, the system adapts without intervention.

RAG is how AI stops being a general-purpose oracle and starts being useful with your specific knowledge.

Exploring how RAG could work with your data? Interlusion builds retrieval pipelines that connect AI to what your organisation actually knows — from first prototype to production. Let’s talk.