The launch of OpenAI’s o1 models and Google’s Gemini 2.0 Flash Thinking Mode has placed ‘reasoning’ models firmly at the top of AI benchmarks and leaderboards. The biggest change with these models is their ability to automatically generate of Chain of Thought (CoT) reasoning steps at inference time. It has even changed how we should approach prompt engineering for these types of models (more info here: Prompt Engineering with Reasoning Models).
Traditional CoT methods, like writing out reasoning chain examples for few-shot prompting, are effective but require significant effort to craft high-quality chains. Zero-Shot CoT, with its simpler “Let’s think step by step” prompt, doesn’t always succeed in eliciting effective reasoning chains and can even degrade performance in some cases.
Over two years ago, Auto-CoT emerged as a solution to automate the generation of reasoning chains. As we discussed in our article about Chain of Thought prompting, Auto-CoT offered a way to streamline the process, but it required complex setup, including clustering, retrieval of diverse examples, and more. While innovative, its complexity limited its practical application.
More recently, a new framework called AutoReason has come onto the scene, offering a simpler yet still effective approach.
AutoReason is a 2-step framework that first generates reasoning chains for any query, and then includes the original query and reasoning chain in a second prompt for a final generation. Like OpenAI’s o1 models, AutoReason removes the need for any manual work in creating reasoning chains.
In this article, we’ll explore how AutoReason works, compare it to approaches like Auto-CoT, and share templates so you can get started testing it out right away.
The challenges with typical Chain of Thought prompting
Chain of Thought (CoT) prompting has long been one of the better prompt engineering methods when it came to more challenging, multi-step tasks.
Implementing Chain of Thought manually relied on creating task-specific reasoning chains and passing them as few shot examples. Using generic prompts like “Let’s think step by step,” offers a simpler alternative, but it is often less effective in breaking down complex problems into subparts.
Newer models like OpenAI’s o1 and Google Gemini 2.0 Flash Thinking Mode have shifted to automated reasoning, bypassing the need for manual CoT setups. While this has unlocked a variety of use cases that require deeper reasoning, the downside is a lack of visibility into the reasoning process. Without explicit reasoning steps, it’s harder to understand what’s happening under the hood or troubleshoot when things go wrong.
AutoReason steps in as a dynamic framework that not only automates reasoning but also retains interpretability, generating explicit reasoning traces tailored to each query.
AutoReason
How AutoReason Works
AutoReason is a two-step framework that generates reasoning chains and then uses those reasoning chains, along with the initial query to generate better outputs.
- Rationale generation: A stronger model, such as GPT-4, creates detailed reasoning traces for a given task or query.
- Answer generation: A smaller, cost-effective model, like GPT-3.5-Turbo, uses these rationales to produce the final answer.
The reasoning chains are generated dynamically based on the input query, which makes the framework adaptable and easy to use.
By separating the generation of reasoning steps from the final answer, AutoReason also provides a level of interpretability that models like OpenAI’s o1 currently lack.
Why I like AutoReason
- Simplicity: Unlike frameworks like Auto-CoT, AutoReason doesn’t require clustering or dataset retrieval, streamlining its implementation. It’s just two prompts.
- Transparency: By generating explicit reasoning steps, you can see and troubleshoot the logic behind outputs.
- Cost-effectiveness: If you’re trying to cut costs you can use a stronger model for generating the reasoning chains and a weaker model for generating the final answer
You can try out AutoReason via the template in PromptHub here.
Automatic Chain of Thought Prompt Enhancer
We recently launched prompt enhancers in PromptHub, including an option to generate chain of thought steps for any prompt. We took a look of inspiration from AutoReason when building this out. Feel free to try it out for free in PromptHub - it's available on all plans!
Experiment results
The researchers evaluated AutoReason on two benchmarks: StrategyQA and HotpotQA. HotpotQA, a simpler dataset, doesn’t require extensive task decomposition. For instance, a question like “Were Scott Derrickson and Ed Wood of the same nationality?” has a straightforward answer: “Yes.”
On the other hand, StrategyQA has more complex questions that demand implicit reasoning. For example, “What is the connection of James Clark Maxwell to bank notifications?” requires breaking down the problem into multiple logical steps.
With that out of the way, lets check out some results
HotpotQA: Fact-Based Tasks
For HotpotQA, AutoReason showed mixed results:
- GPT-3.5-Turbo: Baseline (61.6%), CoT (58.3%), AutoReason (76.6%)
- GPT-4: Baseline (73.3%), CoT (63.3%), AutoReason (71.6%)
While AutoReason improved results for the weaker model (GPT-3.5), performance degraded for the more advanced model (GPT-4-Turbo).
This is important for anyone writing prompts. Sometimes you can overdo it with prompt engineering, where all you need is clear, crisp, instructions.
StrategyQA: Excelling in Complex Reasoning
On the StrategyQA dataset, AutoReason outperformed both baseline and traditional CoT prompting:
- GPT-3.5-Turbo: Baseline (55%), CoT (70.3%), AutoReason (76.6%)
- GPT-4: Baseline (71.6%), CoT (76.6%), AutoReason (91.6%)
By dynamically generating reasoning traces, AutoReason was able to better solve more challenging questions.
The harder the challenge, the better AutoReason performed.
AutoReason vs. Auto-CoT
AutoReason isn’t the first framework for automatically generating reasoning chains and rationales. For example, Analogical prompting solves this, as well as Chain of Verification (CoVe) and Auto-CoT
AutoReason and Auto-CoT take different approaches to accomplishing similar goals.
Auto-CoT: A clustering-based approach
Auto-CoT focuses on creating diverse reasoning demonstrations by clustering questions from a dataset and selecting representative examples. Using the "Let’s think step by step" prompt, it generates reasoning chains for the examples inside the clusters. This enables the examples to be diverse, which is a best practice when doing any sort of few shot prompting.
- Requires preprocessing: Dataset creation, clustering and sampling (retrieval) steps demand time and can be challenging to set up.
- Best suited for static tasks: Works well for predefined datasets but is less adaptable to dynamic or query-specific tasks.
AutoReason: A simple, prompt only, framework
As mentioned above, AutoReason takes a different approach by dynamically generating reasoning traces for any query without the need for clustering or dataset creation. Its two-step process—leveraging a stronger model for reasoning generation and a weaker model for final answers—provides several advantages:
- Easy to implement: No clustering or preprocessing steps required. Just two prompts.
- Adaptability: Tailors reasoning traces to individual queries
- Transparency: Generates explicit reasoning steps, allowing teams to troubleshoot and understand model outputs more easily.
Conclusion
I love AutoReason because it is simple yet powerful. As a prompt-only framework, it’s easy to implement, highly adaptable, cost-efficient, and offers transparency into reasoning steps. Give it a try using the prompt templates available on PromptHub today!