Every team, user, and company trying to leverage AI wants to improve the consistency and accuracy of outputs. It is arguably the most important metric to evaluate when pushing AI features live in production apps or internal tools.
One method to combat this problem is a prompt engineering method called Self-Consistency Prompting. Originally made popular by Google researchers in March 2023 (view the full paper here), Self-Consistency Prompting increases performance by leveraging multiple reasoning paths to get to the most common answer, rather than just selecting the first answer generated.
Just a few months later, the same researchers extended self-consistency prompting to tackle more types of tasks (like open-ended generations) in a method called Universal Self-Consistency (USC).
Let’s dive into how each of these methods works, practical steps to implement them, free templates, and when each is most effective.
What is Self-Consistency Prompting?
Self-Consistency Prompting is a prompt engineering method that enhances the reasoning capabilities of Large Language Models (LLMs) by generating multiple outputs and selecting the most consistent answer among them.
This approach leverages the idea that complex problems can be approached in various ways. By evaluating various paths and outputs, you can identify the most reliable and accurate solution, leading to improved performance.
Self-Consistency Prompting steps
- Chain-of-Thought (CoT) prompting: Initiating the process with Chain of Thought prompting or few shot prompting to demonstrate reasoning examples
- Sampling diverse paths: Rather than generating just a single output, generate a variety of outputs by running the prompt many times
- Choose the most popular answer: Select the most consistent answer from the final set of outputs
Self-Consistency Prompting examples
Self-consistency prompting is effective for tasks with known outputs. Here are a few examples:
- Mathematical problem solving: By generating multiple outputs for a math problem, self-consistency prompting can identify the most consistent solution, leading to higher accuracy in answers.
Example: Solving a math problem like "Sam had 10 socks. If he threw away 3 old ones and bought 36 new ones, how many socks would he have?" - Commonsense reasoning: For tasks like answering commonsense questions, self-consistency prompting evaluates different reasoning paths to provide the most logical and consistent answer.
Example: "The man laid on the soft moss and looked up at the trees, where was the man?" - Symbolic Reasoning: Tasks similar to solving puzzles can benefit from self-consistency prompting by generating multiple potential solutions and selecting the most consistent one.
Example: "If the input is 'abcd', what is the output if we concatenate the first and last letters?"
Additionally, check out the self-consistency prompting examples below from the paper. They show how self-consistency prompting can lead to higher performance than just generating one output (Greedy Decoding).
Below is a 1-shot Self-Consistency prompt template. As mentioned, typically the prompt should be run separately multiple times, rather than multiple times in the same output. But the template can be used as a starting point.
Self-Consistency Prompting experiment setup
Even though the data is old, we’ll take a quick look at the experiments run.
Models
- GPT-3: with a temperature of .7, no top-k set
Tasks and Datasets
- Arithmetic Reasoning:
- Datasets: Math Word Problem Repository, including AddSub, MultiArith, ASDiv, AQUA-RAT, GSM8K, and SVAMP.
- Objective: Solve grade-school math problems and more complex math word problems.
- Commonsense Reasoning:
- Datasets: CommonsenseQA, StrategyQA, and the AI2 Reasoning Challenge (ARC).
- Objective: Answer questions that require commonsense knowledge and logical reasoning.
- Symbolic Reasoning:
- Tasks: Last letter concatenation (e.g., input "Elon Musk," output "nk") and Coin problem.
- Objective: Solve problems that involve pattern recognition and symbolic manipulation.
Methodology
- Few-Shot Setting: All prompts leverage few shot prompting
- Sampling Technique: The same prompt was run multiple times to generate a diverse set of outputs
- Output Sampling: For each experiment, 40 outputs were sampled and the results were averaged over 10 runs to ensure robustness.
Prompting Baseline
- Prompt engineering methods: Self-consistency prompting was compared against CoT-prompting.
Self-Consistency Prompting experiment results
With all of that out of the way, let’s look at some charts and graphs.
Across the two dataset groups above, self-consistency prompting improves chain of thought consistently and achieved state-of-the-art (SOTA) performance in a few datasets.
Increasing the number of 'Sampled Reasoning Paths' increased performance up until a plateau around 40. Similar to the relationship between number of examples use in few shot prompting and overall performance, many of the gains come from sampling any reasoning paths at all (going from 0 to 5).
As you can see above consistency is highly correlated with accuracy. This means that you could theoretically use consistency to model confidence. Self-Consistency Prompting enables some ability for the model to recognize when it doesn’t “know” the answer. The lower the consistency the lower your confidence should be in the answer.
Where Self-Consistency Prompting falls short
While self-consistency prompting can be highly effective for tasks with fixed answer sets, it falls short when applied to free-form generation tasks. Given this limitation, the same research team from Google published a follow-up paper a few months later to make the method more robust.
The new method, called Universal Self-Consistency, extends Self-Consistency Prompting to a broader range of applications, including those involving open-ended and free-form text generation.
What is Universal Self-Consistency prompting
Universal Self-Consistency (USC) is a prompt engineering method that extends the benefits of self-consistency to a broader range of uses cases, including open-ended and free-form text generation.
Rather than aggregating output solutions and looking for the most popular, USC concatenates all outputs and uses an LLM to figure out which answer is the most consistent.
This approach enhances the model’s ability to handle tasks with variable and complex outputs, making it more versatile and effective.
Universal Self-Consistency Prompting steps
- Sampling Multiple Responses: Similar to self-consistency, USC begins by generating many outputs from a single prompt
- Concatenating Responses: All outputs are concatenated
- Selecting the Most Consistent Response: A prompt is sent to the LLM asking it to select the most consistent response from the concatenated outputs. This is final step of calling the LLM to make a decision is what allows for more flexible use cases of the method.
Universal Self-Consistency Prompting examples
We’ll look at two quick examples from the paper, the first being on an open-ended question answering task.
As you can see, for the given open-ended question, the final answer is an entity list where no two responses share the exact same prediction. The LLM correctly selects the response where the individual entities in the list appear most frequently across the responses.
The output format in this example is diverse, which would make a rule-based method of extracting answers challenging. Using an LLM to concatenate the outputs and find the most consistent response allows for a more nuanced approach to increasing accuracy.
This method leverages the LLM's ability to evaluate the frequency and consistency of individual entities across responses, providing a more reliable final answer.
Implementing Universal Self-Consistency without code
Using prompt chains in PromptHub, we can implement the full flow of Universal Self-Consistency without needing to write a line of code.
We'll leverage Universal Self-Consistency to help pick the most consistent option for a LinkedIn post about prompt engineering.
Step 1 - Write a prompt to generate the LinkedIn post
We’ll grab this template in PromptHub.
Step 2 - Run the prompt multiple times
We'll create a prompt chain that will run the LinkedIn post generator prompt three times
Step 3 - Run the Universal Self-Consistency prompt to make the final decision
We'll inject the three outputs from the LinkedIn post generator prompt as inputs into the Universal Self-Consistency prompt template and let the model tell us which is the most consistent.
That's it! We fully implemented Universal Self-Consistency in just a few clicks, and with no code.
Universal Self-Consistency Prompting experiment setup
Okay back to the experiments.
Universal Self-Consistency went head to head with Self-Consistency prompting on a variety of tasks.
- Mathematical Reasoning:
- Benchmarks: GSM8K, MATH.
- Objective: Solve complex math problems using sampled reasoning paths.
- Code Generation:
- Benchmarks: BIRD-SQL, ARCADE.
- Objective: Generate accurate SQL queries and Python code snippets.
- Long-Context Summarization:
- Benchmarks: GovReport, SummScreen.
- Objective: Summarize long documents and TV show transcripts.
- Open-Ended Question Answering:
- Benchmark: TruthfulQA.
- Objective: Provide truthful and informative answers to open-ended questions.
Universal Self-Consistency Prompting experiment results
Overall, Universal Self-Consistency either matched or outperformed Self-Consistency Prompting, showing significant improvements, especially in summarization and open-ended Q&A.
Universal Self-Consistency’s flexibility makes it easy to iterate to get better outcomes. For example, the table below shows that tweaking the prompt used in step three to choose the most detailed response (rather than the most consistent one) led to an increase in performance. Iterating on the language in the final prompt provides another lever to increase performance.
Universal Self-Consistency Prompting benefits
Higher Accuracy: Universal Self-Consistency enhances the accuracy of LLM outputs by leveraging the model's ability to evaluate multiple responses and select the most consistent one.
Adaptability: USC be applied to a wide range of tasks, including those with free-form answers. Additionally, you can easily tweak what you are optimizing for by adjusting the prompt that selects the final answer (e.g., ‘Detailed’ vs. ‘Consistent’).
Deeper Understanding: By exploring various outputs, USC facilitates a deeper exploration of topics.
Wrapping up
Both Self-Consistency Prompting and Universal Self-Consistency Prompting are excellent prompt engineering methods to enhance accuracy and reduce hallucinations. By sampling many outputs and selecting the most consistent response, you minimize the chance of getting a one-off answer that is incorrect. Add this tool to your toolbox and leverage our templates to get you up and running!