Context windows are getting longer and longer. A few weeks ago, Google announced that Gemini 2.5 Pro would support a 2-million-token context window. Just a week later, Meta announced that their new Llama 4 Scout model would support a whopping 10-million-token context window. We’re clearly trending towards larger and larger context windows.

As context windows expand, what does this mean for Retrieval-Augmented Generation (RAG)? If we can support millions of tokens in the context window, what does that mean for RAG? Can we simply load all the context and use Cache-Augmented Generation (CAG) to cache the context upfront?

Hey everyone, today we’re diving into something big: Google’s new model, Gemini 2.5 Pro, might have just completely killed RAG as we know it — and I’ve got the data to back it up. So for context, Gemini 2.5 Pro currently has a **1 million token context window**, with plans to increase that to **2 million tokens** soon. For reference, that’s like cramming **30 copies of The Great Gatsby** into the context window — each one about 60K tokens. Just imagine that — 30 entire books worth of content in one shot. Typically, this is how Retrieval-Augmented Generation (RAG) works: you have a bunch of data — internal docs, financial reports, whatever — and you **chunk** it into smaller parts, turn those into **embeddings**, and store them in a **vector database**. When a user asks a question, that query is also embedded and compared against your vector store. The system then pulls the top matching chunks and feeds them to the model alongside the user query as context. But with models like Gemini 2.5 Pro, we’re starting to ask — do we even need to do that anymore? Enter a new approach: **CAG** — **Cache-Augmented Generation**. To simplify: instead of doing all the chunking, embedding, vector DB-ing, and retrieval steps... you just **throw all your data into the prompt** along with the user’s question. Done. You can get smarter about this by filtering — like if a user is asking about Starbucks, don’t send them Walmart data — but generally, you just feed a big blob of relevant context into the model’s window. Historically, this approach wasn’t feasible because: 1. It was too **expensive** 2. It was too **slow** 3. It simply **wasn’t possible** — older models only had 4K token windows (like GPT-3.5). But now? - **Context windows are bigger** - **LLM costs are dropping** - **Models are getting faster** - And with **prompt caching**, you don’t have to keep reprocessing the same base info Let’s talk numbers. Recent model pricing data shows a clear trend — Google’s models (like Gemini 1.5 and 2.5) are sitting in the bottom-right quadrant: **cheap and fast**. And in terms of context windows, 2 million tokens is far beyond what most RAG systems were ever built to handle. A couple research papers back this up: - One titled *"Don’t Do RAG, Long Context is All You Need"* showed that models with larger context windows **performed better** than traditional RAG methods in most tests — both in **speed** and **accuracy**. - Another from **Databricks** benchmarked large context windows vs. RAG across multiple models. Again, the **long-context** approach either matched or beat RAG in performance. And Google’s Gemini models stood out for how **consistently** they handled large contexts. - Some failure cases came up (especially around content policies triggering when the model saw more data), but overall, the long-context strategy held strong. The big takeaway: **closed-source models like Gemini and Claude** are currently better at using massive context windows than most open-source options, which often degrade in performance as you increase the token count. So what does this mean? If you’re building with updated, large-scale data (like 10-Ks or customer support logs), RAG still might make sense — especially if the data changes often and you want to update your vector DB in real time. But in most other cases, **just shoving everything into the context window and caching it** is simpler, more efficient, and performs just as well (or better). Plus, setup time matters. Developers spend way more time building RAG pipelines than it takes to just cache and send a full prompt. That **opportunity cost** is real. We’ve been testing Gemini 2.5 Pro internally with some “needle-in-a-haystack” experiments — where you bury a small fact inside a massive document — and so far, it’s handled it shockingly well. We’ll share more on that soon. That’s it for now — let me know what you think. This caching-heavy, full-context prompting approach could really change the way we build apps with LLMs. Especially with more reasoning-capable models, this shift feels inevitable. Talk soon!

The Mechanics of RAG vs. CAG

Let’s quickly level set on RAG and CAG. Also, if you’re looking for a deeper dive on RAG, check out our guide here: Retrieval Augmented Generation for Beginners.

Retrieval-Augmented Generation (RAG)

In a traditional RAG setup, you begin with a collection of documents/files. These files are then broken down into smaller, manageable chunks. Here’s the typical process:

  1. Data chunking and embedding: The documents are segmented into smaller chunks. Each chunk is transformed into a vector using an embedding model. These embeddings capture the semantic essence of the text.
  2. Vector database storage: The embeddings are stored in a vector database.
  3. Dynamic retrieval at query time: When a user sends a prompt, the query is also transformed into a vector. The vector database is then searched for the most semantically relevant passages. These passages are retrieved in real time.
  4. Prompt formation: The retrieved snippets are concatenated with the user’s query and sent to the LLM.

Pros and cons of RAG:

  • Pros:
    • Reduces the number of tokens processed in each query, which lowers costs
    • Easy to dynamically and frequently update the content in the vector database to handle new data. Especially useful when the information changes frequently
    • Its modularity makes it easy to test individual pieces
  • Cons:
    • Introduces many extra steps into the whole system, which has a number of downsides (more failures points, can increase latency, complexity of setup and maintenance)

RAG workflow diagram

Cache-Augmented Generation (CAG)

CAG takes a different approach by leveraging large context windows and caching mechanisms to supply information to the LLM.

  1. Preloading and pre-computation: All documents are loaded into the model’s extended context window. For open-source models, you can precompute and store these documents in the model’s key-value (KV) cache, effectively ‘freezing’ its internal state.
    On the other hand, when using a proprietary model via an API, you typically don't have direct access to manipulate the KV cache. You'll need to use the provider’s built-in caching mechanisms (see more details on prompt caching here).
  2. Inference without retrieval: Now when a user sends a prompt, it’s essentially appended to the preloaded context. This allows the model to process the entire cached context along with the new prompt.

CAG workflow

Pros and cons of CAG:

  • Pros:
    • Faster inference times (no retrieval needed)
    • Potentially more robust answers since the whole context is shared. This is dependent on how well the model can handle the number of tokens in the context window.
    • Simplified system design with no need for a separate retrieval component.
  • Cons:
    • Limited by the context window of whatever model you are using
    • Can be more expensive depending on how often the context updates
    • It’s less dynamic. If the underlying data changes often, you may need to reload the context and recompute the cache, which can be expensive.

Performance Analysis: CAG vs. RAG

Alright now we’ll dive into some data the experiments ran in this paper: Don’t Do RAG:When Cache-Augmented Generation is All You Need for Knowledge Tasks

Experiment set up

The experiments evaluated RAG and CAG approaches on several datasets by gradually increasing the total length of the reference texts. Here are the details:

  • Baseline RAG Implementation: Using the LlamaIndex framework, two retrieval strategies were used:
    • Sparse Retrieval (BM25): Relies on keyword matching.
    • Dense Retrieval (OpenAI Indexes): Uses embedding-based semantic search.These methods dynamically fetch and concatenate document chunks at query time.
  • CAG Implementation: The entire set of documents is preloaded into the model’s key-value (KV) cache.

Experiment results

A table of experiment results
Results across all experiments and methods tested
  • Overall CAG outperforms RAG in almost every experiment
  • As the size of the reference text increases, the performance gap between CAG and RAG narrows slightly
  • The gap is pretty small in some cases, but the difference between RAG and CAG isn’t just about performance, it’s also about implementation, cost, and latency

Three tables comparing generation time
Response Time Comparison on HotPotQA (Seconds)

  • CAG greatly reduces generation time by eliminating retrieval overhead

Token sizes tested listed in a table
The SQuAD and HotPotQA test sets with varying reference text lengths.

One more graph, from a different paper Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.

The experiment setup is very similar. A few RAG methods versus preloading the entire reference corpus via CAG, called "Long Context (LC)" in this paper.

A few bar charts side by side showing performance of RAG, CAG, and another method
  • Again, CAG (LC) consistently outperforms RAG
  • CAG is more expensive than RAG when it comes to that first request when caching the reference corpus. Subsequent requests will be less expensive than RAG

Common Failure Modes for RAG:

The study also mentioned four common reasons why RAG fails on certain queries.

  • Multi-step Reasoning (A): Queries that require chaining multiple facts.
  • General Queries (B): Broad or vague questions, making it hard for the retriever to pinpoint relevant chunks.
  • Long, Complex Queries (C): Deeply nested or multi-part questions that the retriever struggles to interpret.
  • Implicit Queries (D): Questions requiring a holistic understanding beyond explicit mentions (e.g., inferring causes in narratives).

Conclusion

So which method should you use? It’s dependent on your use case, but I think CAG makes for a good starting point because of its simplicity. A question we didn’t touch too much on though is, how effective are the models at actually handling hundreds of thousands—or even millions—of tokens? Subscribe to our newsletter or visit our blog next week to learn more!

Headshot of PromptHub co-founder Dan Cleary
Dan Cleary
Founder