How to Use System 2 Attention Prompting to Improve LLM Accuracy

We’ve talked about it before and we’ll talk about it again, small changes in a given prompt can lead to very different outputs (see our article on model sensitivity). Sometimes all it takes is a small piece of irrelevant information to completely throw off a LLM from a task. This affects anyone who interacts with LLMs via a chat interface, like chatGPT, and teams building AI features into their products.

Irrelevant content can lead to errors and other unintentional consequences.

Two papers set out to investigate this challenge around the impact of irrelevant context on prompts and prompt engineering. The first paper, " System 2 Attention (is something you might need too)" introduced a new prompt engineering method called System 2 Attention (S2A) prompting, which helps the model focus only relevant information by regenerating the prompt before processing it.

The second paper, "Large Language Models Can Be Easily Distracted by Irrelevant Context" takes a deeper look at how irrelevant context can directly derail LLM outputs, and which prompt engineering methods can be applied to best avoid this.

Between the two of these papers, we’ll know:

What causes LLMs to “get distracted”
What prompt engineering methods you can leverage to avoid distractions

‍

Why models misinterpret prompts

The heart of the issue around why models sometimes fall short when irrelevant context is included in a prompt lies in the attention mechanism used in transformer models. This mechanism enables the LLM to focus on different parts of the prompt, but it doesn’t always correctly discern between what is relevant and what is not.

As a result, LLMs can end up overweighting irrelevant context, leading to outputs that aren’t truly aligned with the intentions of the prompt.

System 2 Attention prompting is a prompt engineering method that has been developed to address these foundational flaws through prompt engineering.

‍

What is System 2 Attention (S2A) prompting

System 2 Attention prompting (S2A) is a prompt engineering method that prompts the model to regenerate the original prompt to only include the relevant information. The LLM is prompted to only keep the portions of the original prompt that are relevant. We’ll look at an example and prompt template below.

The reason it’s called System 2 Attention is a callback to Daniel Kahneman’s famous distinction between System 1 and System 2 thinking. System 1 is fast and reactive (catching a ball thrown at you), while System 2 involves slow, deliberate thinking (planning how to build a feature).

Even frontier models like GPT-4o, can be tripped up by irrelevant context. If a name, number, phrase, etc., is mentioned in a prompt, the probability that it will occur in the output increases.

Let’s look at a quick example.

San Jose’s mayor, Sam Liccardo, was born in Saratoga, CA. But sending a prompt with a few mentions of Sunnyvale can lead the model to return an output that incorrectly says Sam Liccardo was born in Sunnyvale. Give this a shot in chatGPT and see if what happens for you.

‍

Two prompt examples one showing a typical prompt another using System 2 Attention Prompting

‍

The sentences that include the words “Sunnyvale” inadvertently upweight the token probability of “Sunnyvale” to appear in the output, simply because those tokens appear in the prompt.

‍

Hey everyone, how's it going? Happy Friday! It's a beautiful day here in New York, and today we're going to be kind of short and sweet and to the point about a new prompt method that we just covered on our blog called "System 2 Attention Prompting." Generally, what this method tries to tackle is the issue that can arise when running a prompt where you have irrelevant information making its way into the context of that prompt. Even a slight piece of irrelevant information can really send the model down the wrong path, kind of turning that attention mechanism that makes models so powerful in the wrong direction. So, it's kind of a double-edged sword when it comes to these attention mechanisms of these transformers. System 2 Attention was designed to combat this issue of irrelevant context that isn't necessary for the prompt and could lead the model astray.

Here's a quick example: For context, the question here is, "Where was Sam Lardo born?" He was born in Saratoga, which is the correct answer. You can see correctly, all three models here get the right answer when the prompt is simple and direct. However, on the right side, when you start talking about unrelated details like "Sunnyvale," and then ask where San Jose's mayor was born, they all say "Sunnyvale." This is a classic example of none of this information being relevant, but the mere presence of the word "Sunnyvale" increases the probability that it will appear in the output.

You might say, "Those outputs were from older models like Tex Da Vinci and GPT 3.5, maybe this doesn’t happen anymore." But I ran a batch test in PromptHub with two of what I would call the latest and greatest frontier models—proprietary ones, at least, so GPT-4 and Claude. I sent the same exact prompt from the right side of the previous slide, and you can see GPT-4 confidently gets it wrong again, saying "Sunnyvale," and Claude says it can't determine where the person was born. This shows that even with more advanced models, the presence of unrelated information is enough to mislead the model or make it unsure.

System 2 Attention Prompting aims to combat these situations by first prompting the model to regenerate the original prompt to only include the relevant information. It’s like some pre-processing done by an LLM. Here's what the template looks like: You can create variants of this, but it basically says, "Here's the text; extract the unbiased part," and then include the actual question the user is asking. Separate it into two categories: the unbiased text (relevant context) and the question. Then, you pass along the prompt. It breaks it down into "Here's the context," "Here's the actual question," and then you can do one more step to process it. You could have the LLM process it in this prompt too if you wanted to. We have a template for this in PromptHub, so you can grab it, add it to your library, copy the text, or do whatever you need to do with it.

Here are some examples of how it can make prompts more relevant: If we look at the prompt on the left— "Mary has three times as much candy as Megan..." and then includes a sentence like "Max has a thousand more books than Mary," which is unrelated because we're really just focused on the candy between Megan and Mary, not Max and books—you can see this prompt gets processed through System 2 Attention and becomes the correct context and the actual question. Another example is where leading the model goes wrong: "Which American actor also performs with the band Dogstar?" If you say something like "I think it's Keanu Reeves, but I'm not sure," you're leading the model towards a specific answer. We often say, "Don't lead the model." Even when you're just talking to ChatGPT, you kind of want the model to do the reasoning and not push it too far, or else it might just regurgitate that answer because the probability of those tokens has increased. In this case, the same prompt gets reprocessed, only keeping the relevant context, and then asks the question, leading to the correct answer.

They did a bunch of experiments for this: baseline prompts from a dataset, then they injected irrelevant information into it, an oracle prompt which is just like an ideal version of that prompt with only relevant information, and then System 2 Attention Prompting, which does a one-step processing first. System 2 Attention Prompting takes what is a bad prompt and turns it into a perfect or more focused prompt. They tested this across a couple of datasets. It’s not super relevant what the datasets were; what’s important is the high-level takeaway.

Continuing on this path of models being distracted, there was another paper, which will be linked below, that looked at different prompt engineering methods that could be used to combat irrelevant text in prompts other than System 2 Attention. They took some math questions and injected some irrelevant sentences somewhere in the prompt. It could take a couple of different forms, then they tested a bunch of different prompt engineering methods for each one of these. Self-consistency, few-shot prompting, instructed prompting, least-to-most prompting, and program-of-thought prompting were all methods they tested.

Here's how an original prompt can be transformed into these other methods. It was pretty interesting to see how you can take a normal prompt and transform it into other versions. The results showed performance dropped, as you’d kind of expect when irrelevant information was added. Less than 30% of the problems were solved consistently once those irrelevant sentences were present. Least-to-most prompting and self-consistency prompting were the most effective in handling this irrelevant context. We have guides and resources on how to implement these methods specifically, but both rely on doing multiple LLM calls. Least-to-most breaks it down into multiple reasoning steps, and self-consistency runs a single prompt multiple times to pick the most consistent answer.

Wrapping up here: Attention is what makes LLMs so powerful but can also send them down the wrong path if there’s just a little bit of irrelevant context. Doing a pre-processing step, maybe with a smaller model if you’re worried about latency, could be a good way to combat this and make sure the model stays on task and you get good outputs. Hope this helps, and have a great day!

‍

How does System 2 Attention prompting (S2A) work

System 2 Attention prompting is beautifully simple. There are a few variants, but the core method focuses around prompting the model to rewrite the context by removing the irrelevant text.

System 2 Attention (S2A) prompt template

Here is the core System 2 Attention prompt template:

‍

We also have a System 2 Attention prompt template in accessible in PromptHub

System 2 attention prompt template in the PromptHub application

‍

The template is flexible in that you can replace “unbiased” with whatever you want to focus on. For example you could extract on the “relevant” parts of the text.

‍

System 2 Attention (S2A) prompting examples

Here are a few examples of System 2 attention prompting in action.

‍

Two more prompt examples one showing a typical prompt another using System 2 Attention Prompting

‍

The distracting sentence here is “Max has 1000 more books than Mary”. On the left side, LLaMA-2-70B makes a mistake by including the irrelevant information into the calculation. System 2 Attention prompting regenerates the prompt without the irrelevant context, which leads to simpler processing for the LLM.

‍

A third example of two prompt examples one showing a typical prompt another using System 2 Attention Prompting

‍

The distracting sentence in this example is “I think the answer if Johnny Depp but I’m really not sure”. The presence of this opinion influences LLaMA-2-70B-chat to answer incorrectly. System 2 Attention prompting correctly regenerates only the part of the context that is relevant and removes the opinion.

‍

System 2 Attention prompting experiments

The researchers tested System 2 Attention prompting across 3 datasets: Factual QnA, longform generation of arguments, and math word problems.

For each of these problems, the researchers would inject some form of irrelevant information into the prompt. The additions were of three different types:

Suggesting the correct answer: “I think the answer is [correct answer], but I’m really not sure.”
Suggesting the incorrect answer: “I think the answer is [incorrect answer], but I’m really not sure.”
Refuting the correct answer: “I don’t think the answer is [correct answer], but I’m really not sure.”

The last two statements tended to skew the models toward incorrect answers, while the first statement tended to push the model to answer correctly.

‍

Methods tested

LLama-2-70B-chat was the main model tested here, in two settings:

Baseline Prompt: Prompt from the dataset, with irrelevant information injected
Oracle Prompt: An ideal prompt containing only relevant information, used as a benchmark to evaluate a model's performance without any distracting or irrelevant context.
System 2 Attention prompting: As mentioned above

‍

System 2 Attention experiment results

Let’s look at some graphs

‍

2 separate graphs showing performance of baseline prompting, system 2 attention prompting and oracle prompt on factual QnA datasets

‍

For factual QnA, when opinions (irrelevant details) are included in the prompt, the model loses accuracy dramatically. See the example above about Johnny Depp.

System 2 Attention prompting is able to take the same prompts that are performing at ~63% and increase performance all the way to 80%.

‍

2 separate graphs showing performance of baseline prompting, system 2 attention prompting and oracle prompt on longform datasets

‍

In longform generations, System 2 Attention prompting increases objectivity, as evaluated by GPT-4, on a scale of 0-5.

‍

2 separate graphs showing performance of baseline prompting, system 2 attention prompting and oracle prompt on math word problem datasets

‍

System 2 Attention prompting improves performance on mathematical word problem solving.

‍

Distractibility of LLMs

System 2 Attention prompting is a great prompt engineering method for sanitizing prompts by removing distracting or irrelevant text before having the LLM process it. The second paper we’ll take a look at dives deeper into how different prompt engineering methods can help mitigate the potential for worse performance when irrelevant information is present.

Researchers took a grade-school math dataset and added a single sentence of irrelevant context. As you might have guessed, performance took a big hit. Once the irrelevant sentences were added, less than 30% of problems were consistently solved.

Below are examples of the different types of sentences injected into the math problems.

‍

A diagram showing an original prompt and the various methods the researchers used to add irrelevant information

‍

The researchers then tested a variety of prompt engineering methods to combat the degraded performance.

Specifically they tested:

Self-Consistency prompting
Few-shot prompting, adding irrelevant information to the examples used
Instructed prompting (specifically instructed the model to “feel free to ignore irrelevant information in the problem description”)
Chain of Thought prompting
Least-to-Most prompting
Program-of-Thought prompting

Below are a few examples of how the original prompts were transformed into the various prompting methods.

‍

A diagram showing an original prompt and how it was transformed into different versions for different prompt engineering methods

‍

How different prompt engineering methods affect focus

Let's look at some experiment results across a wide range of prompt engineering methods.

‍

Table of results from the various experiments ran across a variety of prompt engineering methods

‍

Overall, performance drops across all models and prompt engineering methods
On the macro accuracy side of things, (right portion of the chart above), less than 30% of the base problems were consistently solved after adding the distractors
Only 18% of prompts that were originally solved by the LLMs were solved correctly after the addition of irrelevant information.
Least-to-Most prompting was generally the most robust prompt engineering method to combat the irrelevant context problem. Most likely due to the enhanced reasoning capabilities that the method provides
Self-Consistency prompting also substantially reduced the distractibility of the model. By leveraging multiple LLM calls, Self-Consistency prompting increased the likelihood of generating and agreeing upon an answer that avoided the distracting text.
One of my favorite takeaways from this paper is that, for few-shot prompts, using exemplars with irrelevant context consistently outperforms those without. This method improves the robustness of the prompt by showing the model how to ignore distractions.
Additionally, on datasets without irrelevant sentences in the prompt, using examples with distractors didn’t cause a drop in performance.
Instructed prompting (telling the model to ignore irrelevant information), with normal examples was able to perform on par with the uninstructed prompting that used the examples that had distractions in it

‍

Overall, there’s a lot of interesting takeaways for people who are building LLM features into production. Since you never know what data might end up in the final stage of your prompt, having some prompt engineering strategies in place—even simple ones like instructed prompting—can make a real difference.

‍

Wrapping up

Attention is a double-edged sword. It’s what makes LLMs great, but it can also lead them down the wrong path. Both System 2 Attention prompting and the prompt engineering methods explored in the second paper offer different ways to keep your prompts aligned.

System 2 Attention prompting uses a simple approach of using a single prompt to filter out irrelevant information before processing the prompt, improving accuracy and objectivity by refining the input. Prompt engineering methods, like few-shot prompting with examples containing distractors and Self-Consistency prompting, help models learn to ignore irrelevant details, making them more resilient to unexpected content. But why not use a little of both? You can leverage bits and pieces of all the prompt engineering methods mentioned to get better outputs.