Table of Contents

Large Language Models (LLMs) have essentially seen all the text available on the internet, making them particularly adept at tasks involving text.

One of the first breakout use cases for LLMs was code generation via GitHub Copilot. Given that code is just a collection of text, this made sense as a great initial practical application.

Both OpenAI's and Anthropic's models are great at generating code. Some would argue that code generation is one of the things LLMs do best at the moment. While there are some cases where LLMs run into issues generating code (which you can read about in our article here, Where LLMs Fail When Generating Code), in general, they're great at.

Since LLMs are great at generating code, it raises an interesting question: can this skill be leveraged in prompt engineering? Specifically, is it possible to tap into an LLM's ability to write high-quality code for non-code-related tasks? This is precisely what researchers from the University of Waterloo, the Vector institute, University of California (Santa Barbara), and Google research set out to explore with Program-of-Thoughts Prompting (PoT).

What is Program-of-Thought prompting?

Program of Thought (PoT) prompting is a prompt engineering method that enhances performance on numerical reasoning tasks by separating reasoning steps from the final computation. Unlike Chain-of-Thought (CoT) prompting, which uses LLMs for both tasks, Program of Thought prompting has the LLM generate reasoning steps as programming language statements, which are then executed by an external interpreter like Python.

Hey everyone, how's it going? Dan here, co-founder of PromptHub. Hope everyone's doing well. It's a beautiful Saturday in New York, and today we're going to talk about a new prompting method called Program of Thoughts prompting.

You know, GitHub Copilot was really the first and probably the most successful use of LLMs. It was primarily targeted at generating code. LLMs are great at generating code because they've seen a lot of text, including a lot of code, on the internet. There's a lot of open-source repositories to train on, so on and so forth. However, it still comes with its problems when you're using LLMs to generate code. We wrote about this recently in terms of what you can do to overcome specific types of issues when running LLMs to generate code, so you can check that out linked below.

This prompting method taps into the idea that if LLMs are good at generating code, can we use that in any way to help with prompt engineering tasks that are not code-related? That's what this paper set out to do. It's about a year old at this point. Program of Thoughts prompting is very similar to Chain of Thought prompting in that it forces the model to do some reasoning before doing any computation or coming to a final answer. The only difference is it separates the reasoning step and the execution step. It will do the actual execution via an interpreter like Python.

Unlike Chain of Thought, which does it all in one go where you send it a bunch of questions and answers and then it will do the reasoning, Program of Thoughts is a little bit different and requires an external interpreter like Python. You can see two examples here. Program of Thoughts becomes especially helpful when you're dealing with really large tasks. Instead of going through the reasoning in just plain text, it generates some very quick Python code to do so. This has to be sent somewhere to be executed, so there is that additional step, but if you're working on really large datasets that involve math or financial analysis, it can be helpful.

The basis of the experiments is centered around math problems and financial datasets. They compared it to a couple of other prompting baselines. The first one was just head-to-head, and you can see Program of Thoughts outperforms Chain of Thought in a bunch of different math and financial datasets. They also combined Chain of Thought and Program of Thoughts with self-consistency, which we wrote about recently and have a couple of templates for. We see superior performance here as well. At the time, Chain of Thought plus self-consistency was kind of state-of-the-art, and Program of Thoughts was able to surpass that.

The next part of the experiment involved few-shot Program of Thoughts versus zero-shot. We've talked about few-shot prompting before, and you can check out our guide linked below for more information on that. Looking at the results from the few-shot testing, Program of Thoughts outperforms Chain of Thought by 8% on math datasets and 50% on financial datasets. The financial datasets still deal with large numbers but are essentially mathematical equations at their core. On the finQA dataset, which deals with really large numbers like millions, we saw the biggest gain of Program of Thoughts versus Chain of Thought, about a 20% difference.

In terms of zero-shot prompting, Program of Thoughts outperforms Chain of Thought by 12%, which is a significant boost. On the TabMWP dataset, Program of Thoughts zero-shot actually outperformed Chain of Thought few-shot, which was surprising to see.

We put together a template for this so that you can have that code generation step for the reasoning. You still need to execute this, but it lets you get that one step out of the way. You can probably do the rest via just having an interpreter and then doing the final computation.

That's it for today. Happy prompting, guys. I'll see you.

How Program of Thought prompting differs from Chain-of-Thought

Chain-of-Thought uses LLMs for both reasoning and computation. Program of Thought uses LLMs for reasoning, but instead of using plain text for computations, it leverages code.

Let's look at a quick math-related example.

LLMs tend to struggle with math equations for a number of reasons. But by leveraging their ability to generate code and offloading execution to a code interpreter, typical LLM errors can be avoided, reducing the surface area of errors.

Comparing the flow for program of thought versus chain of thought prompting
Comparing the prompt flows for Chain of Thought versus Program of Thought

In the examples above, you can see that LLM struggles with solving the equations when using CoT. With Program of Thought prompting, the model expresses the process in just a few lines of code, which a Python interpreter can execute to obtain the final answer.

Program of Thought prompting experiment setup

Datasets

Program of Thought prompting was tested on eight datasets, covering math word problems (MWP) and financial question-answering (QA) tasks. Program of Thought prompting can only be applied on these type of quantitative tasks, rather than free-form tasks like summarization

  • Math Word Problem Datasets: GSM8K, AQuA, SVAMP, TabMWP, MultiArith
  • Financial QA Datasets: FinQA, ConvFinQA, TATQA

Baselines

Program of Thought prompting was compared against several prompt engineering methods:

  • Direct Answer
  • Chain of Thought (CoT)
  • CoT with External Calculator (CoTcalc)
  • CoT with Self-Consistency (CoT-SC)
  • PoT with Self-Consistency (PoT-SC)

Models

Various models were tested in the experiments (which happened in October 2023)

  • OpenAI Codex (code-davinci-002) API, GPT-3, PaLM, LaMDA, CodeGen (codegen-16B-multi and codegen-16B-mono), CodeT5+, XGen

Specifically, a temperature of 0.4 and K=40 were used throughout the experiments for self-consistency decoding.

Program of Thought experiment results

The researchers tested CoT against Program of Thought prompting in various ways. The first test we'll look at is how the methods performed when including the addition of another prompt engineering method, self-consistency (SC). Self-Consistency prompting involves the model producing multiple outputs and selecting the most consistent one. For more information, check out our rundown on self-consistency prompting here.

Comparing performance between chain of thoughts and program of thoughts with the addition of self-consistency

Program of Thought prompting + SC outperforms CoT + SC (which was state-of-the-art at the time) by an average of 10%.

Next, we’ll look at how few-shot Program of Thought prompting performed. Below is an example of how few-shot Program of Thought works compared to zero-shot.

Two flows for few shot program of thoughts prompting versus zero shot program of thoughts prompting
Few-shot Program of Thought prompting on the left, versus zero-shot Program of Thought prompting on the right

We’ll start with the few-shot Program of Thought experiment results.

Main table of results from experiment for few shot program of thoughts prompting
Few-shot results

  • On average, Program of Thought prompting outperforms CoT by ~8% on the math datasets and 15% on the financial datasets
  • In the financial QnA dataset (FinQA), Program of Thought prompting outperforms CoT by almost 20%. This dataset deals with large numbers (in the millions), which led to many miscalculations when using CoT.

Now let's check out the zero-shot results.

Table of results for zero-shot program of thoughts prompting
Zero-shot results

  • On average, Program of Thought outperforms CoT by ~12% on the math datasets
  • Occasionally, the LLM would fallback to generating a reasoning chain in comments rather than writing executable code. To combat this, the researchers suppressed the ‘#’ token, which denotes a comment in Python, by applying a small bias (of -2) to decrease the probability of the token appearing
  • The margin of improvement of Program of Thought prompting over CoT is even larger on zero-shot prompting compared to few-shot prompting
  • On TabMWP, zero-shot Program of Thought (66.5) even outperforms few-shot CoT (63.4)

Program of Thought prompt template

We extended the prompt used in the zero-shot Program of Thought template to be more robust. You can access the Program of Thought template in PromptHub for free here.

Program of Thoughts template in PromptHub dashboard
Try out our Program of Thought prompting template in PromptHub

Wrapping up

By decoupling reasoning from computation, Program of Thought reduces errors and improves performance over traditional methods like Chain-of-Thought prompting.

Program of Thought prompting not only enhances the accuracy of LLMs but also expands their capabilities in solving more intricate problems.

This approach underscores the potential of extending this method of combining natural language processing with programmatic execution to achieve better results when prompt engineering with LLMs.

Dan Cleary
Founder