If you frequent this blog, you know that we really like prompt engineering methods that increase performance, while being easy to implement. That is why this recent Deepmind study stood out to us: Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.
The study introduced a new prompting method called Step-Back Prompting. A method that showed improvements of up to 36% over chain of thought (CoT) prompting.
What is Step-Back Prompting
Step-Back Prompting draws inspiration from the human tendency to pause and reflect when we are first faced with a challenging task or question. We look for higher level concepts or principles to guide our thinking. For example if tasked with figuring out the length of a side of a triangle, we may first recall the Pythagorean theorem.
Step-Back Prompting is motivated by the observation that many tasks that we assign to LLMs are full of implicit and explicit details. LLMs can have a hard time retrieving relevant facts when tackling these types of tasks.
How Step-Back Prompting works
Step-Back Prompting involves just adding one additional prompt to give the model the freedom to do some abstract thinking before addressing the primary question.
Step-Back Prompting is broken down into 2 steps
- Abstraction: Rather than addressing the question head-on, we would first prompt the LLM to ask a more generic question about a high-level concept, still related to the main question
- Reasoning: Using the first prompt and answer as a grounding mechanism, the LLM can now more accurately reason about a solution to the main question
For example if the main question was ‘What specific steps should I take to reduce my energy consumption at home?', the step-back question may be 'What are the general principles of energy conservation?'. Or, instead of diving straight into 'How do I fix the error in this specific line of code?', a step-back question may be 'What are the common causes of this type of error?'.
A real-world example
Before diving into the experiment results, a quick example.
Let's say we want to know how many U.S. presidents were born in the United States.
We'll compare direct prompting and Step-Back Prompting side-by-side using PromptHub's testing tools.
Here are the prompts:
Here are the outputs:
The proof is in the pudding. Direct prompting misses out on Franklin Roosevelt. This just goes to show how a little bit of prompt engineering can go a really long way to getting better, more accurate results.
Want to try it out for yourself? Here's a single-shot template in PromptHub you can try.
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.
Experiments Setup
The researchers tested Step-Back Prompting across 3 datasets, 2 models, and various prompting methods (few-shot prompting, CoT, direct prompting, and more).
Datasets
- STEM (Science, Technology, Engineering, and Mathematics): Tasks that required analytical thinking and precision.
- Knowledge QA (Question Answering): Scenarios where the model had to retrieve and provide accurate information.
- Multi-Hop Reasoning: Complex questions necessitating the connection of multiple pieces of information to deduce the correct answer.
Models
- PaLM-2L
- GPT-4
Baseline methods
Step-Back Prompting was measured against a few prompting methods:
- Direct prompting
- Chain of Thought (CoT) Prompting
- Take a Deep Breath (TDB) Prompting
- Retrieval-Augmented Generation (RAG)
Experiments Results
STEM Tasks:
Takeaways:
- Step-Back Prompting improves the responses from PaLM-2L drastically
- Step-Back Prompting outperformed CoT and GPT-4
- It would be interesting to see the accuracy of GPT-4 + Step-Back!
Knowledge QA:
Takeaways:
- Step-Back prompting is able to perform well, especially on hard questions ( See column "TQA Hard")
- GPT-4 outperformed Step-Back and Step-Back + RAG on the SituatedQA test set
- Step-Back + Retrieval Augmented Generation (RAG) produced even better results than just Step-Back. Highlighting the importance to combine prompt engineering methods
- Given how easy Step-Back Prompting is to integrate, it is almost surprising how much better the results are compared to direct prompting
Multi-Hop Reasoning:
Takeaways:
- Baseline performance of PaLM-2L and GPT-4 are low in MuSiQue because it requires multiple reasoning steps
- Step-Back Prompting outperforms GPT-4 in both sets
Wrapping up: A Step forward with Step-Back Prompting
When looking through the latest research around prompt engineering, we always look for methods that are both easy to implement and effective. That’s why we love this method, along with other prompt engineering methods like “According to” and EmotionPrompt.
The effectiveness in this method lays in the simplicity. Hopefully this helps you get better and more reliable outputs from LLMs!