Table of Contents

Chain of Thought prompting is by far one of the most well-known, and effective prompt engineering methods. I’ve been a bit intimidated to write about it, partly because of its importance, and partly because Chain of Thought prompting can be a little complex. Unlike other prompt engineering methods, there isn’t a single Chain of Thought prompt template.

Rather, there are various ways to implement Chain of Thought prompting. From adding a simple line of text like “think this through step-by-step” to setting up more elaborate prompts using Chain of Thought prompting combined with Self-Consistency prompting.

In this guide, we’re going to dive deep into what Chain of Thought prompting is all about. We’ll look at the original paper that introduced the concept and explore different ways to implement it, including newer methods like Auto Chain of Thought (Auto-CoT). We’ll include a ton of examples and templates that you can plug and play to get a better understanding of what Chain of Thought prompting actually is, and how you can use it to get better outputs from LLMs.

What is Chain of Thought prompting

Chain of Thought prompting is a prompt engineering method that enhances the reasoning capabilities of large language models (LLMs) by encouraging them to break down their reasoning into a series of intermediate steps. In addition to providing an answer, Chain of Thought prompting requires the model to explain how it arrived at that final answer, offering more transparency and potentially improving accuracy.

At its core, Chain of Thought prompting encourages the model to think through the problem in a step-by-step manner, which is supposed to mimic how humans break down complex problems.

The example below demonstrates a simple Chain of Thought prompting example. Both prompts use few-shot prompting, but the difference is on the left side, the answer is just given, without any reasoning steps ("The answer is 11."). On the right side, the answer is given along with the reasoning steps (highlighted in blue).

2 flows side by side of standard prompting and chain of thought prompting

Before diving into examples, we should probably directly answer the question of why you should care about Chain of Thought prompting and how it can help you.

Hey everyone, how's it going? Dan here, co-founder of PromptHub, and today we're going to talk about one of the more well-known, if not the most well-known, prompt engineering methods called Chain of Thought prompting. We’ll go over what it is, how it helps, look at a bunch of examples since there are many different ways to implement this method. Specifically, we’ll look at how to automate this type of prompt engineering method, how it differs from few-shot prompting, where the limitations are, and then we’ll wrap up.

The basis of a lot of this information comes from a paper out of Google back in 2022 called "Chain of Thought Prompting: Reasoning in LLMs," which will be linked below.

To start off, what is Chain of Thought prompting? Essentially, it's a prompting method that enhances the reasoning capabilities of LLMs by encouraging them to break down their reasoning and actually show their reasoning in their output—breaking down complex tasks into smaller pieces to solve. There are a lot of ways to implement this, and we’ll take a look at some examples.

Here is a classic example pulled directly from the paper. On the left, this is not Chain of Thought prompting, but few-shot prompting. It sends one example and then asks the question. The answer just says “the answer,” but on the other side, it runs through the reasoning: "Rapt started with five balls, then this happened, then this happened, so the answer is 11." That’s the difference—the blue highlighted text is that reasoning being shown.

Why is it helpful? Breaking down complex problems into smaller, more manageable subtasks is always helpful and can give you an insight into how the model is reasoning. Even when you push it to reason, those reasoning chains aren’t always faithful or correct, so this will give you an idea into how the model is coming to an answer. It’s widely applicable—basically anything that requires reasoning or thinking is a good use case for Chain of Thought, and it’s relatively straightforward to implement. There are a lot of different ways to implement it, including some easy ways.

Let’s look at some examples. Zero-shot Chain of Thought is maybe the most common. When you see people say “let’s think step by step,” that is just one form of Chain of Thought prompting, and it’s zero-shot because you’re just including it in the prompt. You send the question, and for the answer, you prepend “let’s think step by step,” prompting the model to generate reasoning steps. This text can take different shapes. A paper tested various types of prompt text, and they found that “take a deep breath and work through this step by step” worked well for some Google models. So, there are little variants you can try, and we have a small template in PromptHub, which will be linked below.

Next is few-shot Chain of Thought. This is pulled from the automatic Chain of Thought paper, which will also be linked below. What they found is that their automated method beats manual Chain of Thought, and few-shot prompting beats zero-shot Chain of Thought. Few-shot includes examples in your prompt of how reasoning steps should look. Everything highlighted here shows reasoning steps being sent to the model to demonstrate: "Here’s a question, here’s an answer, and here’s the reasoning."

Another method is to leverage Chain of Thought with self-consistency. Self-consistency prompting generates multiple outputs and then selects the most consistent one. You can combine this with Chain of Thought or with basically any prompting method.

Next up is not a direct example but more of a variant: step-back prompting. We’ve talked about this before on our blog. First, it tells the model to abstract key concepts and principles before diving in and solving the question, which encourages broader thinking.

Analogical prompting is similar to automatic Chain of Thought prompting. It tries to generate Chain of Thought examples (like the few-shot examples) by identifying concepts, recalling relevant problems, and generating solutions. This is all in the output versus in the input examples, so while it's more expensive, it might lead to better outputs.

Thread of Thought is another cousin of Chain of Thought and is designed to keep a coherent line of thought across a large context, especially in long dialogues or with retrieval-augmented generation (RAG). It prompts the model to walk through context step by step, summarizing and analyzing as it goes.

Contrastive Chain of Thought prompting shows a question with a correct explanation and a wrong explanation, which helps the model understand what not to do, instead of just saying "don’t do X, Y, or Z."

Faithful Chain of Thought prompting ensures that the reasoning and the final answer align. Sometimes the reasoning process diverges from the final answer, even if the answer is correct. This can lead to problems in other contexts. Faithful Chain of Thought tries to avoid this by translating the query into a symbolic reasoning chain and using a deterministic solver to derive the final answer.

Finally, tabular Chain of Thought uses markdown to output reasoning steps in a table format, keeping things more organized.

The last method is automatic Chain of Thought prompting, which helps overcome the problem of not wanting to manually write out examples for few-shot Chain of Thought. It clusters examples based on similarity and samples from those clusters to ensure diversity in examples. This method uses a zero-shot prompt to take questions from the dataset and generate reasoning chains, making it automatic.

In terms of limitations, the original paper found that performance gains from Chain of Thought occurred once the models were very large, around 100 billion parameters. Smaller models produced reasoning chains that seemed coherent but were actually wrong, leading to worse performance than standard prompting.

Faithfulness and reliability are also challenges, as LLMs can generate reasoning chains that look good but may diverge from the final answer. Lastly, implementing these methods can take time and effort, though methods like analogical prompting and automatic Chain of Thought can help with this.

That’s it for today. A little bit of a long one, but we have a bunch of free resources and links to the papers below. Thanks!

Why Chain of Thought prompting is helpful

Chain of Thought prompting provides four major benefits:

  1. Breaks down complex problems: Chain of Thought prompting enables LLMs to decompose complex problems into a series of intermediate steps. This step-by-step approach, in theory, allows the model to allocate more attention to each part of the problem, leading to more accurate reasoning.
  2. A glimpse into the model’s thought process: By seeing the reasoning steps that the model undertakes, users can better understand the model and debug if/when the reasoning paths go wrong.
  3. Widely applicable: Chain of Thought prompting has been successfully tested across a large and diverse set of tasks. It’s versatile enough to be applied to a variety of tasks that require any sort of reasoning.
  4. Easy implementation: While there is a wide range of ways to implement Chain of Thought prompting, there are a lot of very simple ways to do so.

Chain of Thought prompting examples

As mentioned before, Chain of Thought prompting is an extremely versatile prompt engineering method. It can be adapted in various ways, and several variants of this concept have been developed.

We’ll start with some of the more basic examples.

Zero-shot Chain of Thought example

The simplest way to implement Chain of Thought prompting is to include language that instructs the model to reason. The most popular version of this is adding the phrase "Let’s think step-by-step."

Other suggested and thoroughly tested thought-generating phrases include:

  • "Let’s work this out in a step-by-step way to be sure we have the right answer."
  • "First, let’s think about this logically."

Below is an example of zero-shot Chain of Thought prompting. No examples are used to demonstrate reasoning steps; only the reasoning phrase is added.

An example flow for a zero-shot Chain of Thought prompt

Below is a template in PromptHub you can use as well

Zero-shot Chain of Thought prompt template in PromptHub

Few-Shot Chain of Thought Example

Few-shot Chain of Thought prompting is when you provide the model with a few examples of reasoning steps in the prompt. The example reasoning steps included should be related to the problem you are having the model solve.

Few-shot Chain of Thought generally outperforms zero-shot Chain of Thought (see table below). Adding demonstrations can increase accuracy by up to 28.2% in some tasks.

A table of results comparing zero-shot Chain of Thought, manual Chain of Thought, and Auto Chain of Thought
Credit: Automatic Chain of Thought Prompting in Large Language Models

Want nine examples? See below.

The highlighted text shows the few-shot reasoning examples.

A grid of 9 examples of few-shot Chain of Thought prompts
Credit: Chain of Thought Prompting Elicits Reasoning in Large Language Models

Here’s another example for math word problems:

Plus a template in PromptHub:

Few-shot Chain of Thought prompt template in PromptHub

Chain of Thought with Self-Consistency

If you’re not familiar with Self-Consistency prompting, you can check out our guide: Self-Consistency and Universal Self-Consistency Prompting.

Chain of Thought with Self-Consistency prompting takes a Chain of Thought prompt (could be zero-shot or few-shot) and runs it multiple times to generate various outputs. A Self-Consistency prompt is then used to select the most consistent answer.

This helps mitigate one-off reasoning errors and increases the reliability of the output.

Self-Consistency prompt template in PromptHub

Step-Back prompting

Step-Back prompting follows a similar pattern to Chain of Thought in that it prompts the model to generate high-level information about relevant concepts and facts before diving into the task.

Step-Back Prompting prompt template in PromptHub

Analogical prompting

Analogical prompting is one of my personal favorite prompt engineering methods. It is very similar to Auto-Chain of Thought prompting but doesn’t require a dataset of examples with various clusters to choose from (more on this later). Instead, it relies on the model to generate relevant and distinct examples and explanations before solving the problem at hand.

Analogical Prompting prompt template in PromptHub

Thread of Thought

Thread of Thought (ThoT) builds on Chain of Thought prompting with an improved thought inducer that encourages the model to maintain a coherent line of thought across multiple turns.

This method is particularly helpful in longer question-and-answer settings and when using retrieval-augmented generation with large contexts. Other exampl use cases might involve dialogues or storytelling.

Rather than "let’s think step-by-step," the typical phrasing for Thread of Thought is "Walk me through this context in manageable parts, step by step, summarizing and analyzing as we go."

Thread-of-Thought prompt template in PromptHub

Contrastive Chain of Thought prompting

Contrastive Chain of Thought prompting is very interesting as it builds on the key principle of few-shot prompting: examples should be diverse. Contrastive Chain of Thought prompting adds both correct and incorrect examples to the Chain of Thought prompt to teach (show) the model how not to reason by demonstrating faulty logic next to correct reasoning.

Contrastive Chain of Thought Prompting template in PromptHub

Faithful Chain of Thought prompting

Faithful Chain of Thought prompting was designed to ensure that the reasoning chains generated accurately reflect the model’s actual process of arriving at an answer. The idea is that the generated reasoning in Chain of Thought prompts doesn’t always align with how the model actually computes the answer, leading to issues with faithfulness. For more info about when and why models aren't faithful to their reasoning chains—and why it matters—read our full article on Faithful Chain of Thought prompting.

An input and Chain of Thought output where the reasoning and final answer are not aligned
An example where the answer (in green) isn’t derived from the chain of reasoning (blue).

Faithful Chain of Thought addresses this by using both natural language and symbolic reasoning (e.g., Python code) to arrive at a final answer. It’s a two-step process:

  1. Translate the natural language query into a symbolic reasoning chain.
  2. Use a deterministic solver to derive the final answer.

Faithful Chain of Thought Prompting template in PromptHub

Four examples of different queries going through the two-step Faithful Chain of Thought process

Tabular Chain of Thought (Tabular CoT)

Tabular Chain of Thought prompting is similar to Thread-of-Thought prompting in that it uses a different format for the reasoning step. Specifically, it employs a zero-shot Chain of Thought prompt that instructs the model to provide its reasoning in a structured format, typically using markdown tables.

This structured approach helps to improve the clarity and organization of the model's output, making the reasoning process easier to follow.

Tabular Chain of Thought Prompting template in PromptHub

Chain of Thought prompting with reasoning models

Newer models like o1-preview and o1-mini from OpenAI have incorporated chain of thought prompting automatically via inference-time reasoning tokens. The way you prompt these models to reason is very different from non-reasoning models like gpt-4o.

Here are the most important things to know about chain of thought prompting with reasoning models. For more info, check out our full guide: Prompt Engineering with Reasoning Models.

Effectiveness of CoT Prompting:

Adding CoT prompts ("think step-by-step") for simple tasks can sometimes reduce performance by overcomplicating the reasoning process.

Below are the results from a code summarization task from the paper, Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?

table of results showing performance for GPT-4o and o1-mini

As you can see, the addition of CoT for o1-mini didn't change much.

Correlation between reasoning steps and performance:

Research has shown that increasing the number of reasoning steps for challenging tasks can improve performance in reasoning models. Explicitly instructing the model to "spend more time reasoning" correlates with higher accuracy and better outputs. This aligns with OpenAI’s guidance that more reasoning tokens often lead to better results.

“We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute”

For example, in a recent paper, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond the researchers tested two prompts templates:

  1. A prompt that instructs the model to respond with less reasoning tokens.
  2. A prompt that instructs the model to respond with more reasoning  tokens.

They found that as the number of reasoning steps increased, so did the accuracy.

line chart showing performance of quick prompt versus reasoning prompt

Here are the two templates they tested:


Task complexity matters:

Also from Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering? the researchers found that for tasks requiring five or more reasoning steps, reasoning models like o1-preview greatly outperform their non-reasoning counterparts.

On the other side, for simpler tasks with fewer reasoning steps, non-reasoning models like GPT-4o, which can deliver concise and accurate results without unnecessary reasoning, performed better.

Here are the numbers, in relation to code generation tasks:

  • For CoT tasks with five or more steps, o1-mini outperforms GPT-4o by 16.67%.
  • For CoT tasks under five steps, o1-mini's advantage shrinks to 2.89%.
  • For simpler tasks with fewer than three steps, o1-mini underperformed GPT-4o in 24% of cases due to excessive reasoning.

  • Output consistency challenges:

    Reasoning models can sometimes include reasoning "leakage" in their outputs, where the reasoning tokens appears in the response. This is something to keep an eye out for when testing.

    For tasks requiring concise and structured outputs, such as code generation, this can require the need for post-processing, which can be annoying. Non-reasoning models, by contrast, are less prone to this issue.


    Automatic Chain of Thought prompting

    This method variant is important enough to require its own section!

    A key component of few-shot Chain of Thought prompting are the reasoning demonstrations sent along with the prompt.

    Automatic Chain-Thought (Auto-CoT) was designed to overcome the limitation of manually creating demonstrations for the Chain of Thought prompt. It generates the reasoning demonstrations automatically, eliminating the work needed to manually write them.

    The key idea behind Auto-Chain of Thought prompting is that diversity in the questions and reasoning demonstrations used is critical. Using diverse examples is a general best practice in few-shot prompting, as it ensures the model is prepared for a variety of new problems.

    Auto-Chain of Thought prompting first selects diverse examples from a dataset and then generates reasoning chains for the demonstrations.

    How Auto-Chain of Thought prompting Works

    Auto-CoT has 2 main steps:

    1. Question clustering: Questions from a given dataset of potential options are divided into clusters. Clustering the examples helps mitigate the risk of providing examples that are too similar, which could lead the model to overfit its output.
    2. Demonstration Sampling: A question from each cluster is chosen and a reasoning chain is generated using a zero-shot-Chain of Thought prompt.

    Here’s what the 2-step flow looks like:

    Flow chart for how Auto Chain of Thought's 2 step process works

    Auto-Chain of Thought prompting performance

    Let’s see how a few of these Chain of Thought prompting methods stack up across 10 public benchmarks! For some context Manual-CoT = Few-Shot Chain of Thought where the examples are manually written.

    A table of results comparing zero-shot Chain of Thought, manual Chain-of-Thought, and Auto Chain-of-Thought
    Auto-CoT > Manual-CoT > Zero-shot CoT

    Automated Chain of Thought Generation with AutoReason

    AutoReason builds on the strengths of Chain of Thought (CoT) prompting by dynamically generating reasoning traces tailored to any query or task.
    I personally like it much better than the Auto-CoT framework.

    The key idea behind AutoReason is to dynamically create reasoning steps for any query, making the reasoning process both scalable and transparent. To save on costs, you can use a stronger model to generate the reasoning chains and a weaker model to generate the final answer.

    How AutoReason Works

    AutoReason is a 2-step, prompt only framework:

    1. Rationale Generation:
      • A strong model generates step-by-step reasoning traces for a given query or task.
      • These traces break down complex tasks into logical, interpretable steps
    2. Final Answer Generation:
      • A cost-efficient model processes the query along with the generated reasoning traces to produce the final answer.

    Flow diagram for AutoReason

    Here is the AutoReason template if you'd like to test it out.

    AutoReason prompt template in PromptHub

    AutoReason Performance

    The researchers tested AutoReason across two datasets: StrategyQA and HotpotQA

    HotpotQA

    Results for AutoReason on HotpotQA dataset
    • AutoReason increases performance for GPT-3.5
    • AutoReason degrades performance for GPT-4-Turbo

    The reason why AutoReason performs worse with a more advanced model is because the questions in this dataset are relatively simple. Using a framework like AutoReason ends up confusing the model, which is why performance drops. This is very important to note because it is relevant for anyone writing promtps and using advanced models.

    StrategyQA

    Results for AutoReason on StrategyQA dataset

    • Since the questions in strategyQA are much more complex, AutoReason increases performance for both GPT-3.5, and GPT-4-Turbo

    Automatically add Chain of Thought reasoning to your prompt

    We recently launched prompt enhancers in PromptHub, including an option to generate chain of thought steps for any prompt. We took a look of inspiration from AutoReason when building this out. Feel free to try it out for free in PromptHub - it's available on all plans!

    A modal with a task input and a chain of thought reasoning steps for that task

    Difference Between Chain of Thought prompting and few-shot prompting

    Not all few-shot prompts use Chain of Thought reasoning, and not all implementations of Chain of Thought use few-shot prompting.

    For example, zero-shot Chain of Thought prompting, which we looked at above, is a Chain of Thought prompt that doesn’t use any examples. This version implements Chain of Thought prompting by adding a phrase like "let’s think step-by-step."

    On the other hand, you can make a few-shot Chain of Thought prompt where you include questions and reasoning chains to help show the model a typical reasoning process.

    Few-shot Chain of Thought prompts tend to outperform zero-shot Chain of Thought prompts. This is true for most types of prompts. Including examples via few-shot prompting is one of the best ways to enhance output quality.

    Here is an example of a few-shot prompt that doesn’t use Chain of Thought reasoning.

    Chain of Thought limitations

    Chain of Thought is extremely powerful, but like all prompt engineering methods, it has limitations.

    Model size requirement

    In the original research paper, the researchers found that performance gains from Chain of Thought prompting only occurred once model sizes were in the ~100 billion parameter range. This is because Chain of Thought prompting is seen as an emergent behavior of model scale. As shown in the graph below, sharp vertical increases in performance, representing gains, only occur once the model scale reaches ~100 billion parameters.

    The researchers also found that smaller-scale models were more likely to produce illogical yet coherent chains of reasoning, leading to performance that was lower than standard prompting.

    A grid of 6 charts showing how different models and their sizes affect performance

    Faithfulness and reliability

    Sometimes the reasoning chains produced by Chain of Thought prompting don’t accurately represent how the model arrived at the answer. This can lead to a misleading interpretation of the model’s ‘thought process’.

    As mentioned above, Faithful CoT is a nice tool to reach for if you’re worried about this problem.

    Below is an example from a math dataset where the answer (in green) isn’t actually derived from the chain of reasoning (blue).

    An input and Chain of Thought output where the reasoning and final answer are not aligned

    Prompt design and overfitting

    Writing out effective Chain of Thought prompts can be complex and time-consuming. Additionally, as with any few-shot prompt, there is the risk of overfitting to specific types of examples, which can limit the model’s ability to generalize to new problems.

    Methods like Auto-Chain of Thought prompting and Analogical prompting can help with these issues though!

    Wrapping up

    There’s a reason why Chain of Thought prompting is arguably the most well-known prompt engineering method: it’s simple, powerful, and versatile.

    It can be implemented with just a few words, or via few-shot prompting with some examples. You can auto-generate the examples using Auto-CoT or Analogical prompting, or implement a different flavor of reasoning like Step-Back prompting.

    The possibilities seem endless, which is great for anyone building on top of LLMs. There are so many ways to enhance reasoning to improve output quality.

    So many possibilities can also feel overwhelming. If you have any questions on how to best implement chain of thought prompting, just let us know!

    Headshot of PromptHub founder Dan Cleary
    Dan Cleary
    Founder