Self-Consistency and Universal Self-Consistency Prompting

Every team, user, and company trying to leverage AI wants to improve the consistency and accuracy of outputs. It is arguably the most important metric to evaluate when pushing AI features live in production apps or internal tools.

One method to combat this problem is a prompt engineering method called Self-Consistency Prompting. Originally made popular by Google researchers in March 2023 (view the full paper here), Self-Consistency Prompting increases performance by leveraging multiple reasoning paths to get to the most common answer, rather than just selecting the first answer generated.

Just a few months later, the same researchers extended self-consistency prompting to tackle more types of tasks (like open-ended generations) in a method called Universal Self-Consistency (USC).

Let’s dive into how each of these methods works, practical steps to implement them, free templates, and when each is most effective.

‍

What is Self-Consistency Prompting?

Self-Consistency Prompting is a prompt engineering method that enhances the reasoning capabilities of Large Language Models (LLMs) by generating multiple outputs and selecting the most consistent answer among them.

This approach leverages the idea that complex problems can be approached in various ways. By evaluating various paths and outputs, you can identify the most reliable and accurate solution, leading to improved performance.

‍

Showing the flow for chain-of-thought prompting versus Self-consistency prompting — Comparing Chain-of-thought prompting and Self-consistency prompting

‍

Self-Consistency Prompting steps

Chain-of-Thought (CoT) prompting: Initiating the process with Chain of Thought prompting or few shot prompting to demonstrate reasoning examples
Sampling diverse paths: Rather than generating just a single output, generate a variety of outputs by running the prompt many times
Choose the most popular answer: Select the most consistent answer from the final set of outputs

‍

Hey everyone, how's it going? Dan here, co-founder of PromptHub, back today. It's a beautiful Friday here in New York. I hope everyone's doing well. We're going to talk about self-consistency prompting. With that out of the way, I'll make myself small and we can jump right in.

Self-consistency prompting is a relatively old prompt engineering method. It was first written about in 2022, maybe the end of 2023. Essentially, what it does is instead of just sending the prompt once and getting an output, it sends the same prompt many times, generating multiple outputs and then selecting the most consistent answer among them.

For example, it might start with some Chain of Thought or few-shot prompting, where we're sending a Q&A showing the reasoning steps and then ending with a question. So that's kind of Chain of Thought-like. Instead of just sending it once, we're going to send it multiple times, then marginalize out the reasoning paths to aggregate the final answers and choose the most consistent one. In this case, 18 appears twice versus 26, which only appears once.

Typically, this would be done in multiple prompts. We have a template here that jams it into one prompt, which doesn't really do justice to the method. Typically, you want to run the same prompt five times, aggregate those answers, and then take the most consistent one. This is a good starting point.

Even though it was developed a while ago, let's look at the experiment setup. They used GPT-3 with a temperature of 0.7. The temperature here is important because we're running the prompt many times and we're looking for diverse reasoning outputs. We want the model to take different routes. If it gets to the same answer, that's great. If it gets to a different answer, that's okay too. They tested a bunch of datasets, including common sense, symbolic arithmetic, etc. Let's jump to the results.

Self-consistency was tested against just plain Chain of Thought, and here are a couple of tables of results. Starting with the arithmetic datasets, the bolded ones are the best performances in the whole dataset, one per column. GPT-3 was the strongest at the time. You're seeing the best performance for self-consistency, and individually it's always higher than Chain of Thought. There are no negative numbers. The same is true here on the reasoning dataset.

Another interesting point is understanding how many outputs you need. Similar to the chart for few-shot prompting, you see a similar shape here. It plateaus quickly. Few-shot plateaus faster after almost just two examples. Here, you're seeing the plateau happen strongly at the 20 range, very strongly further on. The biggest gains are from not doing anything to doing just some, going from zero to five.

Additionally, they plotted consistency versus accuracy. As you can see, the more consistent the outputs were, the higher the accuracy. It's analogous to a confidence score. If you know where you are on this chart, you can translate that to a confidence level of the output being accurate, which could be helpful in terms of letting the model know when it doesn't know.

Self-consistency is great for answers with a fixed answer set, like a math question, but it falls short on free-form summarization or generation tasks. A few months after the initial paper, the same team from Google developed Universal Self-Consistency, which can be applied universally. It's very similar and can be applied to open-end and free-form text generation. The biggest difference is rather than aggregating the outputs and looking for the most popular, it concatenates all the outputs and uses an LLM to figure out which is the most consistent. It adds an additional prompt at the end.

You can see here, the prompt is fed to an LLM, generating multiple responses just like self-consistency. Then there's a prompt that says, "I've generated these outputs, select the most consistent one." It uses an LLM here instead of something rule-based, and then it gets a response.

We do have a template that is more useful. This is a good starting point to remember how the method works. It's free in PromptHub and linked in the notes below.

Here's another example. We're asking how many different three-digit numbers can A5 represent. The outputs are filled with text; a rule-based approach wouldn't work here. Because we're using Universal Self-Consistency, we concatenate them all together, send it to the LLM, ask which one is the most consistent, and get an output.

There were a bunch of experiment runs and results. The upshot was that Universal Self-Consistency either matched or outperformed self-consistency prompting, especially on summarization and open-ended Q&A. On common sense reasoning datasets, where self-consistency was already doing a good job, the performances were much closer.

What I like about this is its flexibility. Instead of a rule-based approach defined in code, it's an LLM call, which is super flexible. The prompt is very flexible. In this case, the last sentence in the prompt says, "Select the most consistent response based on majority consensus." The key word here is "consistent." In this application study, they tested using the wording "most detailed" instead of "most consistent." They saw increased performance across the board, up to 5% in some cases. Universal Self-Consistency allows for that flexibility and adaptability when working with prompts and LLMs. You can have it pick the most detailed, consistent, or insert your adjective.

We see a similar effect here, less steep of a rise when looking at samples versus performance.

That's it for today. This is one of my favorite methods. It's easy to implement, especially if you use a chaining tool to run prompts multiple times or a batching tool, both of which we have in PromptHub. It's a simple way to make sure you can take the same prompt you're already using, not tweak it too much, and see what the outputs look like. I hope it helps. Have a good one.

‍

Self-Consistency Prompting examples

Self-consistency prompting is effective for tasks with known outputs. Here are a few examples:

Mathematical problem solving: By generating multiple outputs for a math problem, self-consistency prompting can identify the most consistent solution, leading to higher accuracy in answers.
‍Example: Solving a math problem like "Sam had 10 socks. If he threw away 3 old ones and bought 36 new ones, how many socks would he have?"
Commonsense reasoning: For tasks like answering commonsense questions, self-consistency prompting evaluates different reasoning paths to provide the most logical and consistent answer.
‍Example: "The man laid on the soft moss and looked up at the trees, where was the man?"
Symbolic Reasoning: Tasks similar to solving puzzles can benefit from self-consistency prompting by generating multiple potential solutions and selecting the most consistent one.
‍Example: "If the input is 'abcd', what is the output if we concatenate the first and last letters?"

‍

Additionally, check out the self-consistency prompting examples below from the paper. They show how self-consistency prompting can lead to higher performance than just generating one output (Greedy Decoding).

‍

A table of examples of prompts and outputs from Greedy Decoding and Self-Consistency Paths

‍

Below is a 1-shot Self-Consistency prompt template. As mentioned, typically the prompt should be run separately multiple times, rather than multiple times in the same output. But the template can be used as a starting point.

‍

Self Consistency Prompting template in the PromptHub dashboard

‍

Self-Consistency Prompting experiment setup

Even though the data is old, we’ll take a quick look at the experiments run.

Models

GPT-3: with a temperature of .7, no top-k set

Tasks and Datasets

Arithmetic Reasoning:
- Datasets: Math Word Problem Repository, including AddSub, MultiArith, ASDiv, AQUA-RAT, GSM8K, and SVAMP.
- Objective: Solve grade-school math problems and more complex math word problems.
Commonsense Reasoning:
- Datasets: CommonsenseQA, StrategyQA, and the AI2 Reasoning Challenge (ARC).
- Objective: Answer questions that require commonsense knowledge and logical reasoning.
Symbolic Reasoning:
- Tasks: Last letter concatenation (e.g., input "Elon Musk," output "nk") and Coin problem.
- Objective: Solve problems that involve pattern recognition and symbolic manipulation.

Methodology

Few-Shot Setting: All prompts leverage few shot prompting
Sampling Technique: The same prompt was run multiple times to generate a diverse set of outputs
Output Sampling: For each experiment, 40 outputs were sampled and the results were averaged over 10 runs to ensure robustness.

Prompting Baseline

Prompt engineering methods: Self-consistency prompting was compared against CoT-prompting.

‍

Self-Consistency Prompting experiment results

With all of that out of the way, let’s look at some charts and graphs.

‍

A table of performance from multiple models using CoT prompting and self-consistency prompting across a variety of arithmetic reasoning datasets — Performance on the arithmetic datasets

‍

A table of performance from multiple models using CoT prompting and self-consistency prompting across a variety of commonsense reasoning datasets — Performance on commonsense reasoning datasets

‍

Across the two dataset groups above, self-consistency prompting improves chain of thought consistently and achieved state-of-the-art (SOTA) performance in a few datasets.

‍

Four charts next two each other, each comparing the relationship of # of sampled reasoning paths to accuracy on 4 different datasets — Sampled Reasoning Paths = number of times the prompt was run.

‍

Increasing the number of 'Sampled Reasoning Paths' increased performance up until a plateau around 40. Similar to the relationship between number of examples use in few shot prompting and overall performance, many of the gains come from sampling any reasoning paths at all (going from 0 to 5).

‍

A chart comparing accuracy versus consistency — Consistency = the % of prompt outputs agreeing with the final answer

‍

As you can see above consistency is highly correlated with accuracy. This means that you could theoretically use consistency to model confidence. Self-Consistency Prompting enables some ability for the model to recognize when it doesn’t “know” the answer. The lower the consistency the lower your confidence should be in the answer.

‍

Where Self-Consistency Prompting falls short

While self-consistency prompting can be highly effective for tasks with fixed answer sets, it falls short when applied to free-form generation tasks. Given this limitation, the same research team from Google published a follow-up paper a few months later to make the method more robust.

The new method, called Universal Self-Consistency, extends Self-Consistency Prompting to a broader range of applications, including those involving open-ended and free-form text generation.

‍

What is Universal Self-Consistency prompting

Universal Self-Consistency (USC) is a prompt engineering method that extends the benefits of self-consistency to a broader range of uses cases, including open-ended and free-form text generation.

Rather than aggregating output solutions and looking for the most popular, USC concatenates all outputs and uses an LLM to figure out which answer is the most consistent.

This approach enhances the model’s ability to handle tasks with variable and complex outputs, making it more versatile and effective.

‍

Universal Self-Consistency Prompt Template in PromptHub Dashboard

‍

Universal Self-Consistency Prompting steps

Sampling Multiple Responses: Similar to self-consistency, USC begins by generating many outputs from a single prompt
Concatenating Responses: All outputs are concatenated
Selecting the Most Consistent Response: A prompt is sent to the LLM asking it to select the most consistent response from the concatenated outputs. This is final step of calling the LLM to make a decision is what allows for more flexible use cases of the method.

‍

Universal Self-Consistency Prompting flow example

‍

Universal Self-Consistency Prompting examples

We’ll look at two quick examples from the paper, the first being on an open-ended question answering task.

‍

An example of universal self consistency being used on a problem that is a free-form Q&A

‍

As you can see, for the given open-ended question, the final answer is an entity list where no two responses share the exact same prediction. The LLM correctly selects the response where the individual entities in the list appear most frequently across the responses.

‍

An example of universal self consistency being used on a math problem

‍

The output format in this example is diverse, which would make a rule-based method of extracting answers challenging. Using an LLM to concatenate the outputs and find the most consistent response allows for a more nuanced approach to increasing accuracy.

This method leverages the LLM's ability to evaluate the frequency and consistency of individual entities across responses, providing a more reliable final answer.

‍

Implementing Universal Self-Consistency without code

Using prompt chains in PromptHub, we can implement the full flow of Universal Self-Consistency without needing to write a line of code.

We'll leverage Universal Self-Consistency to help pick the most consistent option for a LinkedIn post about prompt engineering.

‍

Step 1 - Write a prompt to generate the LinkedIn post

We’ll grab this template in PromptHub.

‍

Step 2 - Run the prompt multiple times

We'll create a prompt chain that will run the LinkedIn post generator prompt three times

‍

a picture of steps in a prompt chain in PromptHub

‍

Step 3 - Run the Universal Self-Consistency prompt to make the final decision

‍

We'll inject the three outputs from the LinkedIn post generator prompt as inputs into the Universal Self-Consistency prompt template and let the model tell us which is the most consistent.

‍

The universal self consistency step in a prompt chain in the PromptHub dashboard — The last step in the chain examines all the outputs and makes a final decision

‍

That's it! We fully implemented Universal Self-Consistency in just a few clicks, and with no code.

‍

Universal Self-Consistency Prompting experiment setup

Okay back to the experiments.

Universal Self-Consistency went head to head with Self-Consistency prompting on a variety of tasks.

Mathematical Reasoning:
- Benchmarks: GSM8K, MATH.
- Objective: Solve complex math problems using sampled reasoning paths.
Code Generation:
- Benchmarks: BIRD-SQL, ARCADE.
- Objective: Generate accurate SQL queries and Python code snippets.
Long-Context Summarization:
- Benchmarks: GovReport, SummScreen.
- Objective: Summarize long documents and TV show transcripts.
Open-Ended Question Answering:
- Benchmark: TruthfulQA.
- Objective: Provide truthful and informative answers to open-ended questions.

‍

Universal Self-Consistency Prompting experiment results

Overall, Universal Self-Consistency either matched or outperformed Self-Consistency Prompting, showing significant improvements, especially in summarization and open-ended Q&A.

Universal Self-Consistency’s flexibility makes it easy to iterate to get better outcomes. For example, the table below shows that tweaking the prompt used in step three to choose the most detailed response (rather than the most consistent one) led to an increase in performance. Iterating on the language in the final prompt provides another lever to increase performance.

‍

A table showing results from two datasets when the wording in the universal self consistency prompt was changed to choose the most detailed output

‍

Two graphs showing the relationship between number of sample and accuracy across 4 datasets — A similar pattern we saw earlier when comparing number of samples to performance

‍

Universal Self-Consistency Prompting benefits

Higher Accuracy: Universal Self-Consistency enhances the accuracy of LLM outputs by leveraging the model's ability to evaluate multiple responses and select the most consistent one.

Adaptability: USC be applied to a wide range of tasks, including those with free-form answers. Additionally, you can easily tweak what you are optimizing for by adjusting the prompt that selects the final answer (e.g., ‘Detailed’ vs. ‘Consistent’).

Deeper Understanding: By exploring various outputs, USC facilitates a deeper exploration of topics.

‍

Wrapping up

Both Self-Consistency Prompting and Universal Self-Consistency Prompting are excellent prompt engineering methods to enhance accuracy and reduce hallucinations. By sampling many outputs and selecting the most consistent response, you minimize the chance of getting a one-off answer that is incorrect. Add this tool to your toolbox and leverage our templates to get you up and running!

‍

Dan Cleary

Founder