If you’ve spent any time writing prompts, you’ve probably noticed just how sensitive LLMs are to minor changes in the prompt. For example, look at the two prompts below. The semantic differences are minor, but the performance difference is huge. Try to guess which one is better.
This is why prompt testing is so critical. It is hard to know just how these little changes can affect performance, the knife can cut in either direction.
Luckily, there has been a recent flurry of papers related to prompt sensitivity. In this article, we’ll dive deep into the latest research, the implications of prompt sensitivity, and what you need to do if you’re using LLMs in any type of application.
For reference, these are the three papers we’ll be pulling data and insights from:
- How are Prompts Different in Terms of Sensitivity?
- What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering
- On the Worst Prompt Performance of Large Language Models
What is prompt sensitivity
Prompt sensitivity refers to the model’s acute responsiveness to even minor variations in the prompt. The higher the sensitivity, the greater the variation in the output. Every model experiences some level of prompt sensitivity.
For example, the chart below shows how even a minor syntactic rephrasing of the prompt can lead to a complete change in the distribution of outputs.
What is prompt consistency
Prompt consistency measures how uniform the model's predictions are across different samples of the same class.
What’s the difference between consistency and sensitivity?
Sensitivity measures the variation in a model's predictions due to different prompts, while consistency assesses how uniform these predictions are across samples of the same class.
High consistency combined with low sensitivity indicates stable model performance.
How different prompt engineering methods affect sensitivity
Across all three papers, there were various experiments that analyzed how different models and prompt engineering methods related to sensitivity and performance.
We’ll start by taking a look at different prompt engineering methods, starting with the paper How are Prompts Different in Terms of Sensitivity?
The researchers tested 8 methods:
Let’s look at the results, broken down by model.
You’ll see that there is a strong negative correlation between accuracy and sensitivity. I.e., when sensitivity goes up, accuracy goes down.
Impact of human-designed vs. LM-generated prompts on sensitivity
The researchers then tested how human-designed prompts compared to LM-generated prompts in regard to accuracy and sensitivity. Base_b was the human-designed prompt, and APE (Automatic Prompt Engineer) was the LM-generated prompt.
As you can see, overall, the two prompts lead to similar accuracy and sensitivity on the given datasets. Signaling that human-designed and LM-generated prompts had similar effects.
Generated Knowledge Prompting
The next prompting method analyzed was Generated Knowledge Prompting (GKP). GKP is when you leverage knowledge generated by the LLM to give more information in the prompt.
As you can see from the results, GKP led to higher accuracy and lower sensitivity most of time.
This suggests that including instructions and generated knowledge has cumulative effects on performance.
Chain-of-Thought Prompting
Next up was Chain-of-Thought (CoT) prompting, one of the more popular techniques. This approach involves structuring prompts to guide the model through a logical reasoning process, potentially enhancing its ability to derive correct conclusions.
The table above shows that CoT leads to similar accuracy but higher sensitivity compared to the base_b prompt.
Some more CoT data:
As you can see in the graph above, CoT_base_a outperforms base_a, but is worse than base_b in most cases. This suggests that for these datasets, reasoning chains do help improve performance, but not as effective as instructions (GKP).
Simple, detailed, and 1-shot prompting
We’ll continue on with our analysis of different prompting strategies and how they relate to sensitivity and, by proxy, accuracy. We’ll turn to a different paper now: What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering
This paper tested three different prompt engineering methods across a variety of models and datasets
- Simple: The prompt consists of just the task description.
- Detail: A detailed description of the task is provided.
- 1-shot: Similar to simple, but includes one example to illustrate the task.
- Simple and Detail prompting are more effective across all metrics for Llama3 and GPT-3.5
- Detail and 1-shot tends to work better for Mixtral 8x7B and GPT-4o.
- No consistent pattern of best sensitivity, consistency, and F1
It's mentioned above, but is worth repeating. Look again at Llama3 and GPT-4o. The best performing method is completely different, reinforcing the idea that one size does not fit all.
This highlights an important point for developers and teams using LLMs: You need to extensively test your prompts when switching from one LLM to another. A prompt that worked well for one model might lead to instability and decreased performance with another.
Self-refinement, voting and distillation prompting
Turning to our third and last paper, On the Worst Prompt Performance of Large LanguageModels, we’ll look at a few more prompting methods.
In an effort to enhance the performance of prompts that underperformed due to high sensitivity, the researchers tested several prompt engineering methods:
Raw: This method uses the original, unaltered prompts to establish a baseline performance.
Self-refinement: This method involves the LLM iteratively refining prompts based on the model's previous outputs to enhance performance.
Voting: This approach aggregates the outputs from multiple variations of the prompt and lets the model vote for the best result to improve reliability.
Distillation: This technique involves training the model to generalize better by distilling knowledge from multiple training iterations into a single, more robust model.
- Self-Refinement: Significantly decreased performance for Llama-2-7b, Llama-2-13b, and Llama-2-70b chat models, with declines of 10.04%, 13.15%, and 13.53% respectively
- Voting Method: The voting method boosted the worst-case performance significantly (e.g., a 21.98% increase for Llama-2-70B-chat), though it reduced the best and average performances by 6.71% for Llama-2-13b-chat.
- Distillation: Improved consistency but reduced overall performance, likely due to overfitting to lower-quality, self-generated outputs, showcasing the difficulty of balancing refinement with the risk of bias or errors.
Which parts of the prompt are the most sensitive?
As we saw above, different prompt engineering methods focus on different components, such as instructions, examples, and chains of reasoning.
The researchers did a breakdown on which components of the prompt affect the output the most. I.e., which are the most sensitive.
The image below displays these sensitivity scores:
- S_input (4.33): Shows moderate sensitivity, indicating that direct inputs to the model have a substantial impact on the output.
- S_knowledge (2.56): Demonstrates (surprisingly) lower sensitivity
- S_option (6.37): Indicates a higher sensitivity, which implies that the options or choices presented within the prompt are critical in shaping the model's response
- S_prompt (12.86): Exhibits the highest sensitivity, underscoring the significant effect of the overall prompt structure on the model's behavior.
The main takeaway here is that the prompt instructions will have the most impact in guiding the model’s response. This is why we always tell teams that writing clear and specific instructions is step 1 in the prompt engineering process.
Which models are the most sensitive?
We’ve taken a deep dive on how different prompt engineering methods effect sensitivity, but what about different models?
We’ll turn our focus to On the Worst Prompt Performance of Large Language Models.
The researchers created their own dataset, ROBUSTALPACAEVAL, to better match real-world user queries compared to other popular datasets. ROBUSTALPACAEVAL addresses these limitations by generating semantically similar queries to cover a broad range of phrasings.
The table below shows the model performance on their own dataset ROBUSTALPACAEVAL.
- A larger gap between the worst and best perf. indicates higher sensitivity in the model
- Llama-2-70B-chat had a large range, from 0.094 to 0.549. This huge range of values shows how sensitive LLMs can be. Semantically identical prompts can lead to vastly different results.
- Although scaling up model sizes enhances performance, it does not necessarily improve robustness or decrease sensitivity
- For instance, Llama-2-7b, Llama-2-13b, and Llama-2-70b chat models shows improved instruction-following performance, rising from 0.195 to 0.292; however, robustness slightly declines, as indicated by an increase in standard deviation from 0.133 to 0.156
- Similarly, scaling up the Gemma models increases average performance (from 0.153 in the 2b model to 0.31 in the 7b model) but results in lower robustness (0.191 compared to 0.118 in the 2b model).
Having identified the worst-performing prompts for different models, the study next explored specific trends, particularly:
- Whether the worst prompts overlapped across models
- Whether prompt rankings were consistent across various models
The following graph tells the story:
- The consistency between the worst prompts across all models (the red line) is really low. This shows that there is no such thing as a “bad prompt”, it is all relative to the model.
- You see some better consistency when looking within the same family of models, but the rate is still low. This suggests that even individual models within the same family will have their own unique strengths and weaknesses
- It is essentially impossible to characterize the worst prompts without knowing the model
For more information on how models differ in relation to price, performance, and other key metrics, check out our LLM Model Card Directory.
Who is better at picking the better prompt, humans or LLMs?
Remember that first example we looked at, where I asked you to see if you could guess which prompt scored better? If you got it wrong, don’t feel bad, you would probably did as well as ChatGPT.
The researchers tested the model’s ability to discern prompt quality by presenting it with two prompts and asking it to pick the one that would “be more likely to yield a more helpful, accurate, and comprehensive response". I was shocked at the performance.
All the models were right around the 50% mark, which is the same performance you would get if you just guessed randomly. So if the models can’t discern what a better performing prompt looks like, how could we depend on them to write prompts for us?
Wrapping up
Prompts are tricky, LLMs are tricky, this whole new stack brings a set of problems that we aren’t used to solving for in traditional software development. Slight changes in a prompt can significantly impact performance. This underscores the importance of having a prompt management tool (we can help with that) where you can easily test, version, and iterate on prompts as new models continue to come out.