Table of Contents

The timing of OpenAI’s latest release of o1-preview and o1-mini perfectly aligned with our latest blog posts about prompt chaining and Chain of Thought prompting. The major update that we got with o1-preview was the added reasoning that the model does when generating an output.

These reasoning tokens, as OpenAI calls them, are even a new parameter you can set when making requests to constrain the max_tokens available for thought generation.

Anyone who is familiar with Chain of Thought prompting will see that this is exactly what the latest models are doing, just by default.

But, as with regular Chain of Thought prompting, there are some issues, mainly, faithfulness. Sometimes the reasoning chains generated by LLMs don’t actually reflect how the model arrived at the answer, which can lead to explanations that sound correct but are in fact misleading. The reasoning chains can sometimes even have references to entities completely unrelated to the task, aka hallucinations.

This issue existed before o1-preview, and users have reported it with the latest model as well. See this article where a user ran a variety of experiments and noted:  “On examination, around about half the runs included either a hallucination or spurious tokens in the summary of the chain-of-thought.

We’re going to dive into two papers that tested the limitations of Chain of Thought prompting and how a prompt engineering method called Faithful Chain of Thought can help combat this.

Potential ways Chain of Thought reasoning can be unfaithful

The first paper we’ll dive into is from Anthropic, titled “Measuring Faithfulness in Chain-of-Thought Reasoning”.  The paper tested various ways in which LLMs may fail to faithfully follow their reasoning steps. The researchers tested several specific scenarios, each illustrating a potential way Chain of Thought reasoning can be unfaithful:

  • Post-hoc reasoning: Situations in which the model generates reasoning after already deciding what it’s output was going to be.
  • Introducing mistakes: What happens when mistakes are inserted into the reasoning chain?
  • Paraphrasing: Does rewording the reasoning chain affect output consistency?
  • Filler tokens: Is the model’s performance boost driven by the reasoning itself, or by the extra computation time? What happens when the Chain of Thought reasoning steps are replaced with filler tokens?

An image showing various versions of chain of thought faithfulness tests
A snapshot of the various tests used to measure Chain of Thought faithfulness

Hey everyone, how's it going? Dan here, co-founder of PromptHub. Hope everyone is having a great Friday. It’s a little late here in New York but just in time for another video.

Obviously, OpenAI recently came out with a new model that has really leaned into reasoning. If you use it in the ChatGPT interface or review the API, you’ll see there are even new “reasoning tokens” that it has when it's thinking to itself.

But a question arises around the faithfulness of those reasoning steps—how much you can trust them and how much they align with the final answer. There are a number of reasons why this is really important, and that’s what we’ll be looking at today.

Of course, we understand this is a preview of a new model, and it's great—it does really well on certain tasks. But I would suggest you check out this blog post, which is linked below, from the site LLM Mindset in the UK. They ran a bunch of tasks using the new models, and what they found was that around half of the runs included a hallucination or strange tokens in the Chain of Thought. In some cases, there were just wrong answers. For example, Alice gets a 10, which is correct, but Bob gets a 2, and the output is actually a 3.

This isn’t a new concept in terms of the faithfulness of Chain of Thought reasoning steps and how they can differ from what the model outputs as a final answer. This was explored in a paper by Anthropic back in July 2023. They tested how Chain of Thought reasoning could be unfaithful through several methods:

  • Early answering, where they truncated the Chain of Thought steps and tested whether the model still gave the same answer.
  • Adding mistakes, where they inserted mistakes into the reasoning steps to see if it returned the same answer.
  • Paraphrasing, where they paraphrased the reasoning steps.
  • Filler tokens, where they replaced the reasoning steps with irrelevant filler tokens to see if performance changed.

For early answering, they truncated the reasoning chains and checked if the model still arrived at the same answer. If it did, the reasoning was more likely to be unfaithful. If it gave a different answer, it was more faithful.

The results were mixed. In some cases, the model was quite unfaithful, but it still gained performance benefits from the Chain of Thought prompting, which is an interesting takeaway.

For adding mistakes, they inserted incorrect steps into the Chain of Thought and found similar results: the model sometimes arrived at the same answer despite the mistake, which indicates it was not following the reasoning. However, you could argue that the model was smart enough to ignore the mistake, but it still highlights the issue of unfaithful reasoning steps.

With filler tokens, they tested if the performance boost came from extra time to compute or the reasoning steps themselves. If it were just about time, the filler token version should have performed better than no Chain of Thought at all. It didn’t, which showed that the actual reasoning steps matter.

For paraphrasing, the accuracy stayed consistent, meaning the specific wording of the reasoning steps didn’t drive performance, but the gist of the reasoning did.

One final interesting result was that for 6 out of 8 tasks, the 13-billion-parameter model was the most faithful. Larger models, being more confident, often skipped the reasoning steps, a phenomenon known as the inverse scaling hypothesis. Larger models tend to rely less on reasoning since they’ve seen more data, but this can lead to overconfidence, which can be problematic.

So, what can we do? There are a few prompt engineering methods to help extract faithful reasoning. Least-to-most prompting breaks down tasks into subtasks and runs them sequentially via prompt chaining. Tree of Thoughts lets you explore multiple reasoning paths and pick the best one. Multi-persona prompting involves having personas debate to show more reasoning.

Faithful Chain of Thought reasoning, introduced by researchers at UPenn, tackles the problem of reasoning steps that don’t align with the final answer. This method converts natural language into symbolic format (like Python code) and uses a deterministic solver to derive the answer, ensuring alignment between reasoning and output.

That’s it for today! I hope this helps. Have a great weekend, and we'll talk soon.

Chain of Thought reasoning experiments

The researchers selected eight multiple-choice tasks to evaluate the faithfulness of Chain of Thought reasoning. For each task, they sampled 100 reasoning chains to analyze.

Post-hoc reasoning

The goal of this experiment was to test whether the reasoning chain provided by the model genuinely guided it to the final answer or if the reasoning was post-hoc.

The researchers truncated the reasoning chain halfway through and examined whether the model still arrived at the same final answer. If the final answer changed when the reasoning was truncated, it indicates that the reasoning was essential and not post-hoc. On the other hand, if the model consistently arrived at the same answer despite the truncated reasoning, it would suggest that the reasoning was post-hoc and not integral to the decision-making process.

Below are the results from running these tests across the eight datasets.

4 graphs measuring the percentage of same answer as the complete chain of thought reasoning compared to when the chain was arbitrarily cut

  • Overall, the amount of post-hoc reasoning varied significantly across the datasets
  • For ARC (Easy), ARC (Challenge), and OpenbookQA, truncating the reasoning changed the final answer less than 10% of the time (unfaithful)
  • For the AQuA dataset, truncating the reasoning led to changes in the final answer more than 60% of the time (faithful)
  • The results highlight that post-hoc reasoning is highly task-dependent.
  • Even in tasks where Chain of Thought prompting was largely post-hoc (unfaithful), the model still achieved performance improvements. This implies that even when the model isn’t faithful to the reasoning steps, reasoning can still improve performance

Adding mistakes to the Chain of Thought reasoning steps

The next test of the faithfulness of Chain of Thought reasoning involved introducing mistakes into the reasoning chain to see how it affected the final answer.

Specifically, the researchers introduced a mistake in one step of the reasoning chain, then allowed the LLM to continue generating from that point.

An example of how a mistake was added to the flow
The added mistake is underlined.

If the model’s final answer changes after the mistake is introduced, this indicates that the model is genuinely relying on the Chain of Thought reasoning steps to generate its outcome. But if the final answer does not change, it suggests that the Chain of Thought reasoning steps didn’t significantly influence the outcome.

4 graphs measuring the percentage of same answer as the complete chain of thought reasoning compared to when a mistake was inserted into the reasoning chain

  • Across tasks, the results of adding mistakes was similar to the results from the previous tests on post-hoc reasoning.
  • For some tasks, like AQuA and LogiQA, inserting a mistake into the Chain of Thought reasoning often caused the model to change its final answer, indicating that the model followed the reasoning chain more faithfully.
  • On the other hand, tasks such as ARC (Easy) and ARC (Challenge) showed fewer changes in final answers after introducing mistakes, suggesting that the models relied less on the Chain of Thought reasoning for those tasks.

Adding filler tokens

The next hypothesis to test was whether the additional test time provided by Chain of Thought reasoning was responsible for the performance boosts, rather than the actual content of the reasoning steps themselves.

To test this, the researchers replaced the Chain of Thought reasoning steps with “…” tokens.

A graph representing the accuracies on different datasets when filler tokens replace the chain of thought reasoning

  • Adding filler tokens did not lead to any performance gains, signaling that extra test-time compute alone doesn’t account for the performance boost
  • In some cases, the additional filler tokens led to a drop in performance

Paraphrasing

Last up, the researchers tested how rewording the Chain of Thought reasoning steps would alter the output.

If the model is faithfully following the reasoning steps, the answer should remain the same.

Four graphs showing the performance of a variety of datasets when the chain of thought reasoning chains were paraphrased

The accuracy of the paraphrased Chain of Thought closely matched the original across most tasks, indicating that the specific phrasing is not responsible for the performance gains provided by Chain of Thought reasoning.

How model size affects faithfulness

The researchers also tested whether the size of the model influenced how faithful it was to its reasoning.

They compared models of different sizes and measured how often the final answer changed when Chain of Thought reasoning was used versus when it wasn’t. This metric helped indicate how much the model relied on the Chain of Thought reasoning changes to make its predictions.

Two graphs comparing the size of the model and the percentage of the time it gives the same answer with and without chain of thought reasoning
The chance of giving the same answer with and without CoT reasoning compared to model size

For six out of the eight tasks, the 13B model showed more faithful reasoning than smaller or larger models.

Larger models were more likely to give the same answer with or without Chain of Thought reasoning, implying that they may skip the reasoning process altogether because they can confidently predict the answer without needing reasoning guidance.

This is also known as the inverse scaling hypothesis. In short, larger models are less faithful than smaller models because larger models are “smarter” and more confident, which can lead to less of a reliance on chain of thought reasoning.

This is potentially pretty worrying as models continue to grow 🙃.

Methods to extract reasoning from LLMs

Below are a few prompt engineering examples designed to improve reasoning:

  • Least-to-Most Prompting: Breaks down complex problems into subquestions, building up to the final answer via chaining prompts.
  • Tree of Thoughts (ToT): Creates a decision tree to explore different reasoning paths before selecting the best answer.
  • Multi-Persona Prompting:  Leverages multiple personas to debate and reason, arriving at a final answer based on their collaboration.

Methods to increase reasoning faithfulness

So what can be done to combat unfaithful Chain of Thought reasoning? Just a few months after Anthropic's paper was published, Faithful Chain of Thought prompting was introduced

Faithful Chain of Thought Reasoning

To address unfaithful reasoning in Chain of Thought prompting, researchers from UPenn created Faithful Chain of Thought prompting. We’ve touched on this before in our Chain of Thought prompting guide, but we’ll dive deeper into what Faithful Chain of Thought is, how it works, and its potential benefits.

What is Faithful Chain of Thought Reasoning?

Faithful Chain of Thought prompting is a prompt engineering method designed to ensure that an LLM faithfully follows and uses the reasoning chains it generates to arrive at its final answer.

An example where the chain of thought reasoning and the final answer from the model diverge
The Chain of Thought reasoning (blue) arrives at a different answer then the final answer (green)

How Faithful Chain of Thought Reasoningt works

Faithful Chain of Thought ensures that the reasoning chain aligns with the final answer via a two-step process:

  1. Translate the natural language query: Rather than rely solely on natural language reasoning, Faithful Chain of Thought converts the question or task into a symbolic format, such as generating a program in Python.
  2. Use a deterministic solver: Rather than having an LLM follow the reasoning chain generated in step 1, a deterministic solver is used. This guarantees that the reasoning chain is directly responsible for the result and is therefore faithful.

A few examples of the 2 step process for faithful chain of thought reasoning

Faithful Chain of Thought Reasoning prompt template

Below is a Faithful Chain of Thought prompt template that you can access in PromptHub.

Faithful chain of thought prompt template in PromptHub

Benefits of Faithful Chain of Thought Reasoning

  • Improved Accuracy: Faithful Chain of Thought enhances both the transparency of the reasoning process and overall model accuracy (see experiment results below)
  • Trust and Reliability: In high-stakes fields such as medicine, law, and finance, Faithful Chain of Thought ensures users can trust that the model’s reasoning genuinely reflects its decision-making process.

A table of experiment results comparing Faithful chain of thought against a few other prompt engineering methods
Accuracy of a few prompt engineering methods across 10 reasoning datasets.

Conclusion

As LLMs, like OpenAI's recently launched o1-preview, continue to improve their ability to perform complex reasoning with built-in reasoning capabilities, the question of faithfulness becomes increasingly important.

Faithful Chain of Thought prompting and Program of Thought prompting are prompt engineering methods that can potentially ensure that generated reasoning directly informs the final answer, boosting faithfulness and transparency.

Understanding how and why models generate their outputs will become increasingly important as we start to rely on them more and more.

Dan Cleary
Founder