The timing of OpenAI’s latest release of o1-preview and o1-mini perfectly aligned with our latest blog posts about prompt chaining and Chain of Thought prompting. The major update that we got with o1-preview was the added reasoning that the model does when generating an output.
These reasoning tokens, as OpenAI calls them, are even a new parameter you can set when making requests to constrain the max_tokens available for thought generation.
Anyone who is familiar with Chain of Thought prompting will see that this is exactly what the latest models are doing, just by default.
But, as with regular Chain of Thought prompting, there are some issues, mainly, faithfulness. Sometimes the reasoning chains generated by LLMs don’t actually reflect how the model arrived at the answer, which can lead to explanations that sound correct but are in fact misleading. The reasoning chains can sometimes even have references to entities completely unrelated to the task, aka hallucinations.
This issue existed before o1-preview, and users have reported it with the latest model as well. See this article where a user ran a variety of experiments and noted: “On examination, around about half the runs included either a hallucination or spurious tokens in the summary of the chain-of-thought.”
We’re going to dive into two papers that tested the limitations of Chain of Thought prompting and how a prompt engineering method called Faithful Chain of Thought can help combat this.
Potential ways Chain of Thought reasoning can be unfaithful
The first paper we’ll dive into is from Anthropic, titled “Measuring Faithfulness in Chain-of-Thought Reasoning”. The paper tested various ways in which LLMs may fail to faithfully follow their reasoning steps. The researchers tested several specific scenarios, each illustrating a potential way Chain of Thought reasoning can be unfaithful:
- Post-hoc reasoning: Situations in which the model generates reasoning after already deciding what it’s output was going to be.
- Introducing mistakes: What happens when mistakes are inserted into the reasoning chain?
- Paraphrasing: Does rewording the reasoning chain affect output consistency?
- Filler tokens: Is the model’s performance boost driven by the reasoning itself, or by the extra computation time? What happens when the Chain of Thought reasoning steps are replaced with filler tokens?
Chain of Thought reasoning experiments
The researchers selected eight multiple-choice tasks to evaluate the faithfulness of Chain of Thought reasoning. For each task, they sampled 100 reasoning chains to analyze.
Post-hoc reasoning
The goal of this experiment was to test whether the reasoning chain provided by the model genuinely guided it to the final answer or if the reasoning was post-hoc.
The researchers truncated the reasoning chain halfway through and examined whether the model still arrived at the same final answer. If the final answer changed when the reasoning was truncated, it indicates that the reasoning was essential and not post-hoc. On the other hand, if the model consistently arrived at the same answer despite the truncated reasoning, it would suggest that the reasoning was post-hoc and not integral to the decision-making process.
Below are the results from running these tests across the eight datasets.
- Overall, the amount of post-hoc reasoning varied significantly across the datasets
- For ARC (Easy), ARC (Challenge), and OpenbookQA, truncating the reasoning changed the final answer less than 10% of the time (unfaithful)
- For the AQuA dataset, truncating the reasoning led to changes in the final answer more than 60% of the time (faithful)
- The results highlight that post-hoc reasoning is highly task-dependent.
- Even in tasks where Chain of Thought prompting was largely post-hoc (unfaithful), the model still achieved performance improvements. This implies that even when the model isn’t faithful to the reasoning steps, reasoning can still improve performance
Adding mistakes to the Chain of Thought reasoning steps
The next test of the faithfulness of Chain of Thought reasoning involved introducing mistakes into the reasoning chain to see how it affected the final answer.
Specifically, the researchers introduced a mistake in one step of the reasoning chain, then allowed the LLM to continue generating from that point.
If the model’s final answer changes after the mistake is introduced, this indicates that the model is genuinely relying on the Chain of Thought reasoning steps to generate its outcome. But if the final answer does not change, it suggests that the Chain of Thought reasoning steps didn’t significantly influence the outcome.
- Across tasks, the results of adding mistakes was similar to the results from the previous tests on post-hoc reasoning.
- For some tasks, like AQuA and LogiQA, inserting a mistake into the Chain of Thought reasoning often caused the model to change its final answer, indicating that the model followed the reasoning chain more faithfully.
- On the other hand, tasks such as ARC (Easy) and ARC (Challenge) showed fewer changes in final answers after introducing mistakes, suggesting that the models relied less on the Chain of Thought reasoning for those tasks.
Adding filler tokens
The next hypothesis to test was whether the additional test time provided by Chain of Thought reasoning was responsible for the performance boosts, rather than the actual content of the reasoning steps themselves.
To test this, the researchers replaced the Chain of Thought reasoning steps with “…” tokens.
- Adding filler tokens did not lead to any performance gains, signaling that extra test-time compute alone doesn’t account for the performance boost
- In some cases, the additional filler tokens led to a drop in performance
Paraphrasing
Last up, the researchers tested how rewording the Chain of Thought reasoning steps would alter the output.
If the model is faithfully following the reasoning steps, the answer should remain the same.
The accuracy of the paraphrased Chain of Thought closely matched the original across most tasks, indicating that the specific phrasing is not responsible for the performance gains provided by Chain of Thought reasoning.
How model size affects faithfulness
The researchers also tested whether the size of the model influenced how faithful it was to its reasoning.
They compared models of different sizes and measured how often the final answer changed when Chain of Thought reasoning was used versus when it wasn’t. This metric helped indicate how much the model relied on the Chain of Thought reasoning changes to make its predictions.
For six out of the eight tasks, the 13B model showed more faithful reasoning than smaller or larger models.
Larger models were more likely to give the same answer with or without Chain of Thought reasoning, implying that they may skip the reasoning process altogether because they can confidently predict the answer without needing reasoning guidance.
This is also known as the inverse scaling hypothesis. In short, larger models are less faithful than smaller models because larger models are “smarter” and more confident, which can lead to less of a reliance on chain of thought reasoning.
This is potentially pretty worrying as models continue to grow 🙃.
Methods to extract reasoning from LLMs
Below are a few prompt engineering examples designed to improve reasoning:
- Least-to-Most Prompting: Breaks down complex problems into subquestions, building up to the final answer via chaining prompts.
- Tree of Thoughts (ToT): Creates a decision tree to explore different reasoning paths before selecting the best answer.
- Multi-Persona Prompting: Leverages multiple personas to debate and reason, arriving at a final answer based on their collaboration.
Methods to increase reasoning faithfulness
So what can be done to combat unfaithful Chain of Thought reasoning? Just a few months after Anthropic's paper was published, Faithful Chain of Thought prompting was introduced
Faithful Chain of Thought Reasoning
To address unfaithful reasoning in Chain of Thought prompting, researchers from UPenn created Faithful Chain of Thought prompting. We’ve touched on this before in our Chain of Thought prompting guide, but we’ll dive deeper into what Faithful Chain of Thought is, how it works, and its potential benefits.
What is Faithful Chain of Thought Reasoning?
Faithful Chain of Thought prompting is a prompt engineering method designed to ensure that an LLM faithfully follows and uses the reasoning chains it generates to arrive at its final answer.
How Faithful Chain of Thought Reasoningt works
Faithful Chain of Thought ensures that the reasoning chain aligns with the final answer via a two-step process:
- Translate the natural language query: Rather than rely solely on natural language reasoning, Faithful Chain of Thought converts the question or task into a symbolic format, such as generating a program in Python.
- Use a deterministic solver: Rather than having an LLM follow the reasoning chain generated in step 1, a deterministic solver is used. This guarantees that the reasoning chain is directly responsible for the result and is therefore faithful.
Faithful Chain of Thought Reasoning prompt template
Below is a Faithful Chain of Thought prompt template that you can access in PromptHub.
Benefits of Faithful Chain of Thought Reasoning
- Improved Accuracy: Faithful Chain of Thought enhances both the transparency of the reasoning process and overall model accuracy (see experiment results below)
- Trust and Reliability: In high-stakes fields such as medicine, law, and finance, Faithful Chain of Thought ensures users can trust that the model’s reasoning genuinely reflects its decision-making process.
Conclusion
As LLMs, like OpenAI's recently launched o1-preview, continue to improve their ability to perform complex reasoning with built-in reasoning capabilities, the question of faithfulness becomes increasingly important.
Faithful Chain of Thought prompting and Program of Thought prompting are prompt engineering methods that can potentially ensure that generated reasoning directly informs the final answer, boosting faithfulness and transparency.
Understanding how and why models generate their outputs will become increasingly important as we start to rely on them more and more.