Prompt Engineering has evolved significantly in the past year. Slowly, best practices are being established. In-context learning (ICL) via few shot prompting and Chain of Thought prompting are two of the most prominent. While these patterns are effective, they result in longer and longer prompts, which means higher costs and latencies.
In a recent article, we discussed a method to optimize long prompts, but that focused on output quality, rather than controlling length and latency. This is why a recent (December 2023) research paper from Microsoft caught our eye. Titled LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, the researchers looked at how we can compress long prompts without losing out on performance.
In some cases, the researchers were able to compress a prompt by a factor of 20 while experiencing little to no degradation in performance! It’s almost unbelievable, and there are some nuances to account for, so let's dive in.
What is LLMLingua?
LLMLingua is a prompt compression framework/method. It takes a prompt as an input, goes through some steps, and outputs a compressed version of that prompt.
It consists of three major components:
- The budget controller: Controls how much each part of the prompt gets compressed. For example, the few-shot examples should be compressed more than the instructions. It operates at the sentence or demonstration level.
- Token-level prompt compression algorithm: Divides the prompt into segments and compresses it iteratively at the token level, until reaching some threshold.
- Instruction tuning method to align the LLMs used in the process: LLMLingua uses a small model for various processes. Instruction tuning better aligns this model with the black-box LLM that will be used for the final generation (OpenAI or Anthropic’s models). This improves the accuracy and efficiency of the compressed prompts.
How LLMLingua works
Let’s dive into each of the three major components of LLMLingua.
1. Budget Controller
Before any compression, the prompt is separated into different components: The instructions, demonstrations, and the question. The budget controller dynamically allocates different compression ratios to each component.
Generally, the instruction and question in a prompt are more critical and influential. On the other hand, there is often some level of redundancy in the demonstrations.. Thus, more budget (less compression) is applied to the instructions and question, and more compression is applied to the demonstrations. This prioritizes compressing the demonstrations first, allocating any remaining budget to the other components of the prompt.
Let’s see how this actually works in practice:
Okay lets run through this:
- Start with a set of demonstrations from the original prompt
- The compression rate is configurable by the user
- Use a small LLM to compute the perplexity (more on this below) of each demonstration
- Keep the lower perplexity demos (up to a threshold) and compresses all of them
- Once the loop completes, the remaining compression budget is allocated to the instruction and question components of the prompt
Perplexity quantifies how well a model will predict a sample. How likely a model is to generate that sequence based on its learned probabilities.
- Low perplexity: The model predicts the sequence with high accuracy, suggesting the sequence is consistent with what the model 'expects', based on its training.
- High perplexity: The sequence is less predictable. The model is more 'surprised' by it.
2. Token-level prompt compression algorithm
At this point, we got rid of the high perplexity demos and compressed the others.
Now we move from sentence-level compression to token-level compression.
Here are the steps:
- Start with an empty set T (which will later be filled with tokens)
- Take the prompt, now comprising of the original instructions and questions along with the compressed demos, and divide it into segments
- Iterate over the segments and calculate the conditional probabilities for the tokens within
- If the conditional probability of a token exceeds a certain threshold, it is compressed. The compressed segment is then added to set T. This process removes tokens that contribute less to the overall context of the prompt.
- This process continues until all segments have been evaluated
- The final prompt is assembled by combining all the compressed segments
3. Instruction tuning for model alignment
The final step ensures that the smaller model used in the budget controller and token-level compressor aligns with the larger model that will process the final, compressed prompt.
This step essentially involves transferring knowledge from a larger, more capable model to a smaller, more efficient one, ensuring the smaller model performs tasks similarly to the larger one.
Benefits of better alignment:
- Consistency in Probability Estimation: The small model estimates the conditional probabilities of token segments during compression. Aligning it with the behavior of the larger model ensures more accurate probability estimates.
- Improved Compression Decisions: The compression algorithm uses probability estimates to decide which text parts to compress. An aligned small model will make more effective decisions, preserving the semantic integrity of the compressed text.
- Transfer of Knowledge: Large models have a large knowledge base. Aligning the small model to the large model enables the transfer of this understanding.
Experiment Setup
LLMLingua was put to the test across various datasets.
Datasets
- GSM8K: A dataset focused on mathematical reasoning, testing LLMLingua's ability to compress prompts in domains requiring logical and numerical understanding.
- BBH (Big-Bench Hard): A dataset that includes tasks that require complex reasoning, testing LLMLingua in contexts that demand high cognitive capabilities.
- ShareGPT: This dataset is centered around conversational tasks, evaluating LLMLingua's performance in compressing prompts for dialogue-based scenarios.
- Arxiv-March 2023: A summarization dataset derived from scientific articles, this set tests LLMLingua's ability to effectively condense and convey information.
Evaluations
For GSM8K and BBH: Exact match was used as the primary evaluation metric. This measures whether the generated response exactly matches the expected answer
For ShareGPT and Arxiv-March23: A combination of BLEU, ROUGE, and BERTScore metrics were used to assess the quality of outputs in comparison to human-generated texts.
Implementation details
- Models used: GPT-3.5-Turbo-0301 and Claude-v1.3
- Small model used for compression: Alpaca-7B or GPT2-Alpaca
Baselines
The team compared LLMLingua against several benchmark methods.
- GPT4-Generation: Directly instructing GPT-4 to compress the original prompt
- Random Selection: Randomly selects which elements of the original prompt to compress
- Selective-Context: Utilizing phrase-level self-information from a small language model, this method filters out less informative content from the prompt. It aims to retain the most informative or critical parts while removing less significant content (more info here).
Experiment results
Let’s start with the ShareGPT and Arxiv-March23 datasets
Sentence Selection = Random Selection baseline
“2x constraint” = Compress the original prompt to half the size
“3x constraint” = Compress the original prompt to a third of the size
Now for some takeaways:
- LLMLingua achieved acceleration ratios of 9x and 3.3x (process was 9 and 3 times faster)
- High BS F1 scores indicate successful retention of semantic info from the original prompts
- Random sentence selection ("Sentence Selection" in the table) outperformed LLMLingua twice and is relatively close in performance many other times
- Under the 2x constraint, all three baselines perform similarly, with an average difference of about 4%. This suggests that at lower compression levels, for use cases related to comparing or summarizing texts, any of these methods could work.
- LLMLingua is less sensitive to higher compressions. When moving from 2x to 4x compression LLMLingua's performance decreases the least.
Next up GSM8K and BBH, the reasoning and in-context learning-related benchmarks
1-shot constraint = The model was given 1 example in the prompt
1/t = compression ratio
Some learnings:
- With a 1-shot constraint, the LLMLingua compressed prompt achieved slightly higher results than the full-shot prompt at compression ratios of 5x and 3x.
- As compression ratios increased under half-shot and quarter-shot constraints, there was a slight decline in performance. On the GSM8K dataset, the Exact Match (EM) scores decreased by 1.44 and 1.52, respectively, at 14x and 20x compression ratios. Seems like a small degradation given the level of compression.
- Contrary to the first set of results, LLMLingua easily beats the other compression baselines
- Even at 20x compression, GSM8K EM scores remain high dropping by less than 2 points
- These points suggest that LLMLingua’s effectiveness varies based on the task. It appears to be very effective on reasoning tasks (GSM8K and BBH), while only being moderately better on conversational and summarization tasks (ShareGPT and Arxiv-March2023).
Don’t forget about Claude!
For “cost reasons” the researchers only tested Claude-v1.3 on the GSM8K dataset. They also buried it deep in the paper and left it off the main table of results.
- LLMLingua showed improvements over the simple prompt by 0.8 and 1.7 EM points at compression ratios of 5x and 14x, respectively.
- It's worth noting that the Simple Prompt score here is higher than the Simple Prompt score in the table above this one (74.9). Showing that with just a simple prompt, Claude beats out GPT-3.5-turbo in this case.
- Maybe we shouldn’t be surprised that Microsoft researchers buried this, but it looks like LLMLingua was most effective when using Claude 1.3 compared to GPT-3.5-Turbo
Ablation study
Now for my favorite part. The researchers tested five variants of LLMLingua to see which components contributed to the overall performance.
- LLMLingua w/o Iterative Token-level Compression: This variant performs token-level compression in a single step rather than iteratively. The EM score decreased significantly from 79.08 to 72.93, indicating that iterative compression is important.
- LLMLingua w/o Budget Controller: This variant applies the same compression ratio across all prompt components. The EM score dropped to 73.62, showing that dynamically allocating compression ratios to different parts of the prompt is worthwhile.
- LLMLingua w/o Dynamic Compression Ratio: This variant uses a static compression ratio for all components, resulting in a lower EM score of 77.26. Not a huge drop.
- LLMLingua w/ Random Selection in Budget Controller: Instead of selecting sentences based on perplexities, and conditional probabilities, this variant randomly selects them. The EM score took a big drop to 72.78.
- LLMLingua w/o Distribution Alignment: By removing the distribution alignment component, the model directly uses the pre-trained LLaMA-7B small language model. The slight decrease in the EM score to 78.62 indicates that the alignment process may not be critical.
- LLMLingua w/ Remove Stop Words: Removes stop words from the original prompts.
Other findings and limitations
- As compression ratios increases the length of the output decreases, with variance
- This could be a good: Reduces resources on the generation stage which is the chief contributor to latency (see here)
- This could be bad: You may lose out on some of the good stuff!
LLMLingua has its limitations and reaches a compression plateau.
- Big performance drop when reaching really high compression ratios
- LLMLingua’s ("Ours") drop occurs at higher compression ratios comparatively
Wrapping up
Let’s finish with an example.
Say you have a prompt that is 2,000 tokens long, you're using GPT-4 which currently costs $0.03/1,000 prompt tokens, and you have 2,000 requests/month. Using LLMLingua, let’s compress it by 10x.
Initial prompt
- Length: 2000 tokens in length
- Cost: 2,000 tokens * $0.03 per 1,000 tokens * 2,000 requests/month = $120.00/month
Compressed prompt - 10x compression
- Length: 200 tokens
- Cost: 200 tokens * $0.03 per 1,000tokens * 2,000 requests/month = $12.00/month
That’s a 10x reduction in cost! Of course, you’d need to ensure performance is stable. You may need to reduce the compression rate, or maybe you can go even higher.
LLMLingua can have a huge impact for anyone building prompts into production, but there are several nuances that come along with it. We’ve taken the time to iron out these nuances (what if your prompt doesn’t have clear distinctions between instructions and questions?) and are launching these compression capabilities directly into PromptHub. Right now it is early access only, so reach out if you're interested!