Retrieval-Augmented Generation, also known as RAG, is a way to provide specific context to a Large Language Model when sending your prompt. I recently put out a video on RAG and in my research I found most RAG content out there to be either:

  • Overly technical
  • Really long

So for those that prefer text over video, and concise and not super technical overviews, this one is for you! We’ll dive into what RAG is, how it works and a few examples with some beautiful flow charts. Plus we’ll also go over how to set up a very simple RAG pipeline with just a few lines of (open-source) code.

Hey everyone, how’s it going? We’re going to try and do a sub-five RAG overview — and I already stumbled a little bit, so let’s get right into it.

This is going to be focused on a non-technical audience — you don’t have to be an engineer to understand what’s going on. I’m certainly not. That’s really the main point here. We’ll look at a little bit of code just to give you an understanding, and I’ll share a lot of freebies throughout.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It’s a way to give LLMs additional knowledge — whether that’s up-to-date news or internal docs that aren’t part of the training set — in addition to your prompt.

Once LLMs are trained, they’re kind of frozen. They don’t get continually updated with new information. That’s why we need to give it to them. For example, public companies release 10-Ks every quarter. If we wanted to know about Google’s last quarter, we’d send a prompt like “How was last quarter?” and then pass in the relevant 10-K data.

RAG lets us give the model additional knowledge by retrieving relevant context from a vector store — rather than just pasting it into the prompt.

How It Works

Let’s say a user asks: “How much revenue did Google have in their cloud business?” That question is translated into an embedding — basically, a numerical vector representation of the text.

Embeddings exist in a vector space, where related words are placed near each other. For instance, the word “dog” will be close to “dogs,” “cat,” etc.

We start with a set of data — like all of Google’s 10-Ks — and chunk it into smaller pieces. Each chunk is embedded and stored in a vector database (like Pinecone).

When the user asks their question, we search the vector database for the most semantically similar chunks and return those. The LLM receives both the retrieved chunks and the original query as part of the final prompt.

Why It’s Useful

  • Customer Support: If a conversation gets too long for the context window, RAG can help retrieve only the relevant information.
  • Internal Knowledge Access: Instead of feeding the model every doc your company has, RAG can dynamically fetch only the docs relevant to the question (e.g. just an analytics report, not your HR policies).
  • Real-Time Updates: You can update your vector database in real-time by adding more documents — e.g. new 10-Ks — as they become available.

If you just sent the question to the LLM without RAG, it wouldn’t know about Google’s most recent 10-K — because it wasn’t in the training data. But with RAG, you can retrieve the relevant info and include it as context.

Quick Code Example

We’ll use LlamaIndex to stand up a basic RAG pipeline. Example code is below this video, but here’s the high-level process:

  1. Import modules from LlamaIndex.
  2. Use a reader to load your docs (in this case, PDFs stored in a data/ folder).
  3. Create a vector store index using OpenAI’s embedding model.
  4. Query the index using semantic similarity search to get relevant chunks.
  5. Return a response from the LLM based on those chunks and the user’s query.

Something I ran into: asking the model for “two takeaways from each article” didn’t really work well. The model started hallucinating based on papers from its training data rather than sticking to the PDFs we loaded. That’s an important reminder — your pipeline needs to be aligned with the type of queries you expect.

We’ll do another video soon on whether or not you should use RAG at all. But for now, this is your quick overview. The code and examples are linked below — go try setting up your own pipeline, test it out, and see if it works for you.

See you next time!

Understanding RAG

Retrieval Augmented Generation, or RAG, is a method to optimize the output from an LLM by supplementing its static knowledge with external, up-to-date information.

LLMs are powerful but, once trained, they don’t automatically update with new information. RAG fixes that by retrieving relevant data (think news articles, or company 10-Ks) and feeding it to the model alongside your prompt.

How RAG Works: A step-by-step overview

The graphic looks more complex than it actually is. Let’s break it down.

A diagram for the flow of data when using retrieval-augmented-generation

1. Data processing and embedding generation

  • Data Source (blue square): Your process begins with raw documents—this could be PDFs, web pages, or other text data.
  • Chunking: Large documents are split into smaller, manageable chunks. This makes it easier for the system to handle and retrieve specific pieces of information.
    There are many different ways to handle chunking data, it can be tricky. For example, if we added some documentation as a data source and half of a code example was in chunk 1 and the other half in chunk 2, that will cause some issues.
  • Embedding Conversion: Each chunk is transformed into a numerical vector (an embedding) that captures its semantic meaning. What’s an embedding? We’ll touch on this more later, but you can also learn more about it here: A Beginner's Guide on Embeddings and Their Impact on Prompts

2. Vector database role

Once converted, these embeddings are stored in a vector database—a specialized storage system (purple cylinder). This database lets the system quickly search for and retrieve the chunks that are most relevant to a given query.

3. User query transformation & retrieval

  • Query transformation: When a user asks a question, that query is also converted into an embedding.
An example of translating a word into a vector embedding
Example: The English word "angry" translated into a vector embedding.

  • Retrieval: The system then searches the vector database for chunks whose embeddings are similar to the query’s embedding—meaning they are close to each other in the vector space. These relevant chunks then provide context for the LLM.

For reference, here is a visualization of the word “dog” in a vector space, showing how semantically related words cluster around it.

The word "dog" being represented in a vector space

Here are the closest points in the vector space, showing the words that appear most similar to “dog.”

Points closest to "dog" in the vector space, in a table

4. Send it all to the LLM

The retrieved chunks—converted back to text, along with the original query—are then fed into the LLM. This enriched context enables the model to generate responses that are both more accurate and comprehensive.

Simple RAG workflow

Put more succinctly, here is how RAG works.

  1. Data → Chunking → Embeddings: Raw data is transformed into embeddings, which are stored in the vector database.
  2. Query Processing: A user query is converted into an embedding, which is then used to retrieve the most relevant data chunks by comparing the similarity between embeddings.
  3. LLM Integration: The LLM receives both the query and the retrieved context to generate a final output.

Retreival Augmented Generation example: Up-to-date financial services assistant

Let’s run through a quick example.

Imagine you built a chatbot or an app that is a financial assistant. This assistant will need up to date information on publicly traded companies. These sources of information could be 10-Qs (quarterly) and 10-Ks (annual). Here’s how you could use RAG to make this efficient.

Flow breakdown

  • Processing the filings: All filings are chunked, converted into embeddings, and stored in a vector database. We consistently add new filings as they are released.
  • Targeted Retrieval: When the user asks a question related to recent financial information about a company, the system retrieves only the relevant chunks from the filings.
  • Enhanced Response: The LLM then provides an answer based on the latest information, that it wouldn’t have access to otherwise (knowledge cutoff).

Setting up a simple RAG pipeline in 5 minutes

Part of what drew me to RAG now was that I wanted to stand up a quick pipeline to test it out. I was able to do so in just a few lines of code and had it running in less than 20 minutes.

Here is a link to the repository that has the code, and I walk through it in my video here. But here is a general overview of the pipeline.

  1. Data Loading: Load your documents using a reader (using LlamaIndex).
  2. Chunking & Embedding: Break down the docs into chunks and convert to embeddings.
  3. Storage: Save these embeddings in a vector database for fast retrieval.
  4. Query Handling: Convert user queries into embeddings and search the vector database.
  5. Response Generation: Pass both the query and retrieved data to the LLM to generate a response.

Conclusion

RAG is just one method you can use to optimize LLM performance. I personally am not very bullish on RAG. As context windows get larger, and models become faster and cheaper, I think you’ll see more use of Cache-Augmented Generation (CAG), otherwise know as “shove it in the context window”. But that analysis is for another time and another blog post (soon)!

Headshot of founder Dan Cleary
Dan Cleary
Founder