Most of the research around Large Language Models (LLMs) and prompt engineering focuses on improving the quality of answers. We’ve covered a bunch of different prompt engineering methods (multi-persona prompting, tree-of-thoughts prompting, and according to prompting) that are of varying degrees of successful at this.
That is why this paper from the Microsoft and Tsinghua University caught my eye. (You can check it out here). It introduces a new prompting method called Skeleton of Thought (SoT).
The SoT method is different from other methods because it is built not just to get better outputs, but to make the LLM work faster and more efficiently. Before we dive into this method, let's talk about a few challenges LLMs face.
Performance challenges with LLMs
LLMs are extremely powerful, but they have their faults when it comes to latency and efficiency.
- Latency Issues: When a model generates an output, it returns one token at a time (more on tokens here). This is the extremely time consuming decoding phase.
- Resource Under-utilization: LLMs run on Graphics Processing Units (GPUs). GPUs are designed to handle multiple tasks at once. Since tokens are generated step-by-step, GPU power is often underused.
- Complexity and Efficiency Trade-off: The larger you make the model, the more it will “know” but the more it will have to sift through when generating outputs.
How Skeleton of Thought Prompting works
The Skeleton of Thought framework looks to reduce latency by enabling parallel processing.
At the heart of SoT is the idea of segmenting output generation. Instead of generating a response in a straight line, SoT divides the content into distinct segments.
These segments are processed simultaneously, allowing for multiple parts of an answer to be crafted at once. It's like writing several sentences of a paragraph in parallel, rather than sequentially (this has drawbacks, which we'll touch on later).
Anatomy of the SoT Framework: Understanding the Prompts
SoT uses two prompts to guide the LLM to generate an output efficiently.
1. Skeleton prompt
The process begins with an initial prompt that instructs the model to produce a structured skeleton of the intended answer. Kind of like bullet points, or an outline.
2. Point-Expanding Stage
Next, the LLM is prompted to expand on each point from the list. This expansion happens in parallel, enabling those latency gains we discussed earlier. For models like OpenAI’s this would mean calling their API multiple times for each item in the list.
We put together a template so you can try out this method easily in PromptHub (link here).
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.
Experiments: Setup
The researchers put SoT to the test with a few experiments. The goal was to investigate how SoT reduces the end-to-end latency across different models and question types.
These experiments consisted of a wide range of tasks from code generation to complex, multi-faceted writing.
Datasets: Vicuna-80 dataset, which consists of 80 questions spanning nine categories.
Models: 11 models, 9 open-source and 2 API-based models.
Benchmarks: SoT was compared to other typical prompting methods
The results
Speed-up breakdown: Models
The first experiment was designed to see how SoT reduced latency on different models.
What jumps out is that SoT obtains a >2x speed-up in 6 out of 11 models.
Speed-up breakdown: Question categories
Next, the researchers broke down the speed-up gains by question category.
Latency Breakdown: SoT stages
The graph below presents the absolute latencies of normal and SoT-generated responses.
The decoding (token generation) phase accounts for the majority of the end-to-end latency.
Overall Quality
Let’s take a look at how SoT compares to normal generation when it comes to quality of output. To compare the answer quality of normal prompting to SoT, the researchers used two LLM-based evaluation frameworks: FastChat and LLMZoo.
Each answer is presented to an LLM judge (ChatGPT3.5 in this case) and asked for its preference.
As we can see, SoT performs better than or equal to normal prompting ~80% of the time.
Quality Breakdown: Question Categories
Let’s see how SoT performs across different question categories.
SoT performs relatively well on generic, common-sense, knowledge, roleplay, and counterfactual. SoT performs relatively badly on writing, fermi, math, and coding.
Let’s take math as an example. Math questions require step-by-step thinking. Without knowing the previous steps, it is going to be really hard to figure out the next step. A method like Tree of Thoughts would perform better here. In contrast, SoT requires the model to come up with the skeleton of the solution from the start and then figure out each individual step independently without referring to previous results.
Looking at the categories that SoT performed well on (Counterfactual, knowledge, common sense, generic), they all have the same characteristic: the ideal answer should cover several relatively independent points.
SoT performs well when the question can be answered in several points whose details can be expanded independently. If the question requires step-by-step thinking, SoT will perform poorly.
Results Quality Breakdown: Metrics
Last up, the researchers looked at which aspects of SoT can either enhance or detract from the quality of the answers.
As we can see, SoT improves the diversity and relevance, while hurting the immersion and coherence of outputs.
- Coherence: SoT underperforms because it breaks tasks down into steps which are evaluated independently.
- Immersion: SoT has a hard time maintaining a consistent role given the way the framework breaks the answer down into a skeleton.
- Diversity: The skeleton stage in SoT encourages LLMs to think from multiple perspectives.
- Relevance: In the skeleton stage the model is forced to propose several points related to the specific (relevant) question. In the point-expanding stage, LLMs are required to only discuss these points.
Wrapping up
The Skeleton of Thought framework is mainly focused on reducing latency, rather than increasing quality . This fresh take is interesting but has drawbacks. Combining this approach with other prompt methods could marry the best of both worlds, but parrallel processing of chunks will always have coherency issues.
If you’re interested in other prompt engineering methods I would recommend checking out our other articles, or trying the prompts directly in PromptHub.
- Exploring Multi-Persona Prompting for Better Outputs
- How Tree of Thoughts Prompting Works
- Improve accuracy and reduce hallucinations with a simple prompting technique
Happy prompting!