Large Language Models (LLMs) have essentially seen all the text available on the internet, making them particularly adept at tasks involving text.
One of the first breakout use cases for LLMs was code generation via GitHub Copilot. Given that code is just a collection of text, this made sense as a great initial practical application.
Both OpenAI's and Anthropic's models are great at generating code. Some would argue that code generation is one of the things LLMs do best at the moment. While there are some cases where LLMs run into issues generating code (which you can read about in our article here, Where LLMs Fail When Generating Code), in general, they're great at.
Since LLMs are great at generating code, it raises an interesting question: can this skill be leveraged in prompt engineering? Specifically, is it possible to tap into an LLM's ability to write high-quality code for non-code-related tasks? This is precisely what researchers from the University of Waterloo, the Vector institute, University of California (Santa Barbara), and Google research set out to explore with Program-of-Thoughts Prompting (PoT).
What is Program-of-Thought prompting?
Program of Thought (PoT) prompting is a prompt engineering method that enhances performance on numerical reasoning tasks by separating reasoning steps from the final computation. Unlike Chain-of-Thought (CoT) prompting, which uses LLMs for both tasks, Program of Thought prompting has the LLM generate reasoning steps as programming language statements, which are then executed by an external interpreter like Python.
How Program of Thought prompting differs from Chain-of-Thought
Chain-of-Thought uses LLMs for both reasoning and computation. Program of Thought uses LLMs for reasoning, but instead of using plain text for computations, it leverages code.
Let's look at a quick math-related example.
LLMs tend to struggle with math equations for a number of reasons. But by leveraging their ability to generate code and offloading execution to a code interpreter, typical LLM errors can be avoided, reducing the surface area of errors.
In the examples above, you can see that LLM struggles with solving the equations when using CoT. With Program of Thought prompting, the model expresses the process in just a few lines of code, which a Python interpreter can execute to obtain the final answer.
Program of Thought prompting experiment setup
Datasets
Program of Thought prompting was tested on eight datasets, covering math word problems (MWP) and financial question-answering (QA) tasks. Program of Thought prompting can only be applied on these type of quantitative tasks, rather than free-form tasks like summarization
- Math Word Problem Datasets: GSM8K, AQuA, SVAMP, TabMWP, MultiArith
- Financial QA Datasets: FinQA, ConvFinQA, TATQA
Baselines
Program of Thought prompting was compared against several prompt engineering methods:
- Direct Answer
- Chain of Thought (CoT)
- CoT with External Calculator (CoTcalc)
- CoT with Self-Consistency (CoT-SC)
- PoT with Self-Consistency (PoT-SC)
Models
Various models were tested in the experiments (which happened in October 2023)
- OpenAI Codex (code-davinci-002) API, GPT-3, PaLM, LaMDA, CodeGen (codegen-16B-multi and codegen-16B-mono), CodeT5+, XGen
Specifically, a temperature of 0.4 and K=40 were used throughout the experiments for self-consistency decoding.
Program of Thought experiment results
The researchers tested CoT against Program of Thought prompting in various ways. The first test we'll look at is how the methods performed when including the addition of another prompt engineering method, self-consistency (SC). Self-Consistency prompting involves the model producing multiple outputs and selecting the most consistent one. For more information, check out our rundown on self-consistency prompting here.
Program of Thought prompting + SC outperforms CoT + SC (which was state-of-the-art at the time) by an average of 10%.
Next, we’ll look at how few-shot Program of Thought prompting performed. Below is an example of how few-shot Program of Thought works compared to zero-shot.
We’ll start with the few-shot Program of Thought experiment results.
- On average, Program of Thought prompting outperforms CoT by ~8% on the math datasets and 15% on the financial datasets
- In the financial QnA dataset (FinQA), Program of Thought prompting outperforms CoT by almost 20%. This dataset deals with large numbers (in the millions), which led to many miscalculations when using CoT.
Now let's check out the zero-shot results.
- On average, Program of Thought outperforms CoT by ~12% on the math datasets
- Occasionally, the LLM would fallback to generating a reasoning chain in comments rather than writing executable code. To combat this, the researchers suppressed the ‘#’ token, which denotes a comment in Python, by applying a small bias (of -2) to decrease the probability of the token appearing
- The margin of improvement of Program of Thought prompting over CoT is even larger on zero-shot prompting compared to few-shot prompting
- On TabMWP, zero-shot Program of Thought (66.5) even outperforms few-shot CoT (63.4)
Program of Thought prompt template
We extended the prompt used in the zero-shot Program of Thought template to be more robust. You can access the Program of Thought template in PromptHub for free here.
Wrapping up
By decoupling reasoning from computation, Program of Thought reduces errors and improves performance over traditional methods like Chain-of-Thought prompting.
Program of Thought prompting not only enhances the accuracy of LLMs but also expands their capabilities in solving more intricate problems.
This approach underscores the potential of extending this method of combining natural language processing with programmatic execution to achieve better results when prompt engineering with LLMs.