Strategies for Managing Prompt Sensitivity and Model Consistency

If you’ve spent any time writing prompts, you’ve probably noticed just how sensitive LLMs are to minor changes in the prompt. For example, look at the two prompts below. The semantic differences are minor, but the performance difference is huge. Try to guess which one is better.

‍

Two similar prompts on top of each other — Find the answer in my LinkedIn post

‍

This is why prompt testing is so critical. It is hard to know just how these little changes can affect performance, the knife can cut in either direction.

Luckily, there has been a recent flurry of papers related to prompt sensitivity. In this article, we’ll dive deep into the latest research, the implications of prompt sensitivity, and what you need to do if you’re using LLMs in any type of application.

For reference, these are the three papers we’ll be pulling data and insights from:

‍

What is prompt sensitivity

Prompt sensitivity refers to the model’s acute responsiveness to even minor variations in the prompt. The higher the sensitivity, the greater the variation in the output. Every model experiences some level of prompt sensitivity.

For example, the chart below shows how even a minor syntactic rephrasing of the prompt can lead to a complete change in the distribution of outputs.

‍

Two bar charts showing prompt performance on top of each other — Different prompt variants lead to extremely different output distributions

‍

What is prompt consistency

Prompt consistency measures how uniform the model's predictions are across different samples of the same class.

‍

What’s the difference between consistency and sensitivity?

Sensitivity measures the variation in a model's predictions due to different prompts, while consistency assesses how uniform these predictions are across samples of the same class.

High consistency combined with low sensitivity indicates stable model performance.

‍

Hey everyone, how's it going? This is Dan here from PromptHub. I'm enjoying this Saturday here in New York, and today we're going to talk a little bit about prompt sensitivity and model consistency. Generally, what that means is that little variations in a prompt can lead to wildly different effects, which, if you've done any prompt engineering or worked on any AI systems, I'm sure you can empathize with. But we will start off with a little quiz here.

There are two prompts, one of which performs 10% better than the other. Do you think you can guess which one it is? I'll give you a quick second to read them through. If you guessed B, you're correct. I got this wrong when I was first reading it in a paper some time ago, and I think it really just points to the fact that these are semantically the same, right? They're very, very similar. I think the key difference here in the performance comes down to the verb usage: classify versus categorize, which maybe to you or me didn't really seem like a big deal, but obviously to the model made a huge difference. A 10% performance gain is huge.

So, we're going to talk a lot about this topic of sensitivity, and luckily there was a flurry of papers around this that all came out within a week of each other, so there's a lot of data to fall back on. Setting the stage: what is prompt sensitivity? It refers to the model's acute responsiveness to even minor changes—little changes that cause big output differences. Here's an example of two different prompts, variant number 10 and variant number 8, and their predictions when running through a dataset. The difference in these prompt variants was very minor; it was all semantic, yet you can see how it can flip the output almost directly on its head.

We'll start with how prompts differ in terms of sensitivity as the main paper we'll look at first. We'll be looking at different prompt engineering methods and how they relate to prompt sensitivity. They tested a whole bunch of different prompts: base prompts, minor prompts, chain of thought, AP (Automatic Prompt Engineer using an LLM), GKP (Generated Knowledge Prompting), and more. We'll look at each similarly.

Before doing any head-to-head comparisons, we'll look at a model level. Overall, you'll see a strong negative correlation between accuracy and sensitivity—sensitivity goes up, accuracy goes down. These are the models that were used in the experiments here. No GPT-4, which is unfortunate, but I think a lot of the learnings can be extrapolated.

A couple of head-to-heads here: looking at human-designed versus LLM-generated prompts on sensitivity. APE stands for Automatic Prompt Engineer, and Base B was just a basic prompt they wrote—nothing crazy. You'll see it's pretty similar accuracy-wise, and sensitivity-wise we're close here. Here, there's a bigger difference on the accuracy side, but also the base B is much less sensitive. So, it's a give and take; I think overall they more or less even out. I think it's an important case to note.

Next up is Generated Knowledge Prompting, which we just wrote an article about, linked below. Basically, what Generated Knowledge Prompting does is it has the model generate some knowledge about whatever the task or question is before answering it. It's kind of like reasoning in that way, but slightly different. If the question was about which country is the largest in Africa, you would first tell it to think about population sizes in Africa across different countries and then generate an answer. We can see it generally outperforms the base: higher accuracy, lower sensitivity. This was definitely one of the higher performing ones, suggesting that generating that knowledge has a solid effect on performance.

Next up was Chain of Thought, which led to higher accuracy but also higher sensitivity, which kind of goes against the negative correlation we were seeing before. So it's definitely worth noting. Part of this could be because Chain of Thought has a lot more tokens in the prompt, and I think as you add more tokens, you're just opening up to more surface area for prompt sensitivity.

Some more Chain of Thought breakdowns: Base A outperforms Chain of Thought Base A typically outperforms normal Base A but worse than Base B in many cases. These are just different prompting methods listed on that table, suggesting that for certain datasets, reasoning chains may help performance, but they are not as effective as Generated Knowledge Prompting.

Moving on to the next paper, which looked at a couple of different prompting methods: Simple (straightforward vanilla prompting), Detail (much more detailed instructions), and One Shot (simple plus an example). The breakdown shows Simple and Detail are more effective for Llama 3. The red shows the highest performance for that specific model. For GPT-3, it's almost all in the One Shot category, mixed results for Llama, mostly One Shot, little detail. The red is the best performing for the dataset, not specifically for the model. Again, we're seeing that different methods have varying outputs on performance, sensitivity, and consistency. This reinforces something we talk about a lot: what works for one model is not necessarily the best method for another; there are lots of intricacies even within the same family.

Moving on, we'll look at our last paper, which was interesting. They created their dataset called Robust Alpaca Eval, taking known datasets and trying to make the queries more reflective of real-world use cases. Many datasets can be cherry-picked and not really map to production use cases. The researchers tried to pull stuff that was more real-world focused and made a bunch of semantically equivalent options. They tested a few methods here: self-refinement, voting, and distillation. We see the original performance, the worst, the best, the average, and the standard deviation for various models. Self-refinement showed big drops for Llama 2B, the 7 billion parameter model. Voting showed some wins for the largest version of the model.

As we discussed earlier, different prompt engineering methods lead to different outcomes. Generated Knowledge versus Chain of Thought: reasoning chains versus additional knowledge had different effects on the outputs. They did an additional breakdown where they looked at the sensitivity of each prompt component. The instructions had the highest sensitivity, meaning it's the most important for generating the outcome. Knowledge and input were lower on the totem pole.

Coming back to the dataset and performances across models, the standard deviation and delta between the worst and best performance show how sensitive the model is. Llama 2 70B has a huge gap from 9 to 54, the largest gap of any model. Scaling up enhances performance but doesn't necessarily improve robustness or decrease sensitivity. Bigger models don't always solve problems.

The researchers then asked if there were any overlaps in prompts that don't perform well. If certain prompt types didn't work well across models, that would be great to know. However, the overlap was basically at zero, even within the same family. No universally bad prompt exists; it's very model-dependent. You need to continue testing across the board.

Lastly, they added an analysis where they asked the model to select the better prompt from a pair. The performance was around 50%, like flipping a coin. It questions if models can figure out which prompt is better and if we can depend on them to write better prompts than we can. With additional context and reward loops, we might improve this, but core-level prompting knowledge hasn't made it into these foundational models yet.

It's a longer one today, but the headline is these things are tricky and non-deterministic. It's hard to figure out what's going to work and what's not, so it's important to test these things. That's it for today, thanks!

‍

How different prompt engineering methods affect sensitivity

Across all three papers, there were various experiments that analyzed how different models and prompt engineering methods related to sensitivity and performance.

We’ll start by taking a look at different prompt engineering methods, starting with the paper How are Prompts Different in Terms of Sensitivity?

The researchers tested 8 methods:

‍

A list of prompt engineering methods in a table

‍

Let’s look at the results, broken down by model.

‍

6 graphs showing prompt engineering performance on different models — The average accuracy and sensitivity of each model using various prompts across different datasets.

‍

You’ll see that there is a strong negative correlation between accuracy and sensitivity. I.e., when sensitivity goes up, accuracy goes down.

‍

Impact of human-designed vs. LM-generated prompts on sensitivity

The researchers then tested how human-designed prompts compared to LM-generated prompts in regard to accuracy and sensitivity. Base_b was the human-designed prompt, and APE (Automatic Prompt Engineer) was the LM-generated prompt.

‍

A table showing the accuracy and sensitivity of human-written prompt versus LLM-generated prompt — Comparing the performance of models using base_b (human written-prompt) and APE (LLM generated-prompts)

‍

As you can see, overall, the two prompts lead to similar accuracy and sensitivity on the given datasets. Signaling that human-designed and LM-generated prompts had similar effects.

‍

Generated Knowledge Prompting

The next prompting method analyzed was Generated Knowledge Prompting (GKP). GKP is when you leverage knowledge generated by the LLM to give more information in the prompt.

‍

A table of results, broken down by model, comparing a basic prompt and generated knowledge prompt

‍

As you can see from the results, GKP led to higher accuracy and lower sensitivity most of time.

This suggests that including instructions and generated knowledge has cumulative effects on performance.

Chain-of-Thought Prompting

Next up was Chain-of-Thought (CoT) prompting, one of the more popular techniques. This approach involves structuring prompts to guide the model through a logical reasoning process, potentially enhancing its ability to derive correct conclusions.

‍

A table of results, broken down by model, comparing a basic prompt and CoT prompt

‍

The table above shows that CoT leads to similar accuracy but higher sensitivity compared to the base_b prompt.

Some more CoT data:

‍

6 bar charts showing prompt sensitivity for 6 models

‍

As you can see in the graph above, CoT_base_a outperforms base_a, but is worse than base_b in most cases. This suggests that for these datasets, reasoning chains do help improve performance, but not as effective as instructions (GKP).

‍

Simple, detailed, and 1-shot prompting

We’ll continue on with our analysis of different prompting strategies and how they relate to sensitivity and, by proxy, accuracy. We’ll turn to a different paper now: What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering

This paper tested three different prompt engineering methods across a variety of models and datasets

Simple: The prompt consists of just the task description.
Detail: A detailed description of the task is provided.
1-shot: Similar to simple, but includes one example to illustrate the task.

‍

A large table of results of different prompting methods, datasets, and models, measuring sensitivity, consistency, and F1 scores — Sensitivity Sτ (lower is better), average Consistency Cy (higher is better), and Micro-F1 scores (higher is better) for various datasets, models, and prompting strategies.

‍

Simple and Detail prompting are more effective across all metrics for Llama3 and GPT-3.5
Detail and 1-shot tends to work better for Mixtral 8x7B and GPT-4o.
No consistent pattern of best sensitivity, consistency, and F1

‍

It's mentioned above, but is worth repeating. Look again at Llama3 and GPT-4o. The best performing method is completely different, reinforcing the idea that one size does not fit all.

‍

This highlights an important point for developers and teams using LLMs: You need to extensively test your prompts when switching from one LLM to another. A prompt that worked well for one model might lead to instability and decreased performance with another.

‍

Self-refinement, voting and distillation prompting

Turning to our third and last paper, On the Worst Prompt Performance of Large LanguageModels, we’ll look at a few more prompting methods.

‍

In an effort to enhance the performance of prompts that underperformed due to high sensitivity, the researchers tested several prompt engineering methods:

Raw: This method uses the original, unaltered prompts to establish a baseline performance.

Self-refinement: This method involves the LLM iteratively refining prompts based on the model's previous outputs to enhance performance.

Voting: This approach aggregates the outputs from multiple variations of the prompt and lets the model vote for the best result to improve reliability.

Distillation: This technique involves training the model to generalize better by distilling knowledge from multiple training iterations into a single, more robust model.

‍

A large table of results showing the different performance metrics and changes based on which prompt engineering method was used — Model performance after prompt engineering (Self-refinement and Voting) and prompt consistency regularization (Distillation). The red numbers indicate a decrease in performance, the green numbers represent an improvement.

‍

Self-Refinement: Significantly decreased performance for Llama-2-7b, Llama-2-13b, and Llama-2-70b chat models, with declines of 10.04%, 13.15%, and 13.53% respectively
Voting Method: The voting method boosted the worst-case performance significantly (e.g., a 21.98% increase for Llama-2-70B-chat), though it reduced the best and average performances by 6.71% for Llama-2-13b-chat.
Distillation: Improved consistency but reduced overall performance, likely due to overfitting to lower-quality, self-generated outputs, showcasing the difficulty of balancing refinement with the risk of bias or errors.

‍

Which parts of the prompt are the most sensitive?

As we saw above, different prompt engineering methods focus on different components, such as instructions, examples, and chains of reasoning.

The researchers did a breakdown on which components of the prompt affect the output the most. I.e., which are the most sensitive.

The image below displays these sensitivity scores:

‍

A small, 1-row table, showing the saliency scores of four different prompt components — The average *mean saliency scores* of prompt components

‍

S_input (4.33): Shows moderate sensitivity, indicating that direct inputs to the model have a substantial impact on the output.
S_knowledge (2.56): Demonstrates (surprisingly) lower sensitivity
S_option (6.37): Indicates a higher sensitivity, which implies that the options or choices presented within the prompt are critical in shaping the model's response
S_prompt (12.86): Exhibits the highest sensitivity, underscoring the significant effect of the overall prompt structure on the model's behavior.

The main takeaway here is that the prompt instructions will have the most impact in guiding the model’s response. This is why we always tell teams that writing clear and specific instructions is step 1 in the prompt engineering process.

‍

Which models are the most sensitive?

We’ve taken a deep dive on how different prompt engineering methods effect sensitivity, but what about different models?

We’ll turn our focus to On the Worst Prompt Performance of Large Language Models.

The researchers created their own dataset, ROBUSTALPACAEVAL, to better match real-world user queries compared to other popular datasets. ROBUSTALPACAEVAL addresses these limitations by generating semantically similar queries to cover a broad range of phrasings.

‍

The table below shows the model performance on their own dataset ROBUSTALPACAEVAL.

‍

A table showing performance metrics for a few different models — Results on our ROBUSTALPACAEVAL benchmark. The model order is arranged according to their original performance.

‍

A larger gap between the worst and best perf. indicates higher sensitivity in the model
Llama-2-70B-chat had a large range, from 0.094 to 0.549. This huge range of values shows how sensitive LLMs can be. Semantically identical prompts can lead to vastly different results.
Although scaling up model sizes enhances performance, it does not necessarily improve robustness or decrease sensitivity
For instance, Llama-2-7b, Llama-2-13b, and Llama-2-70b chat models shows improved instruction-following performance, rising from 0.195 to 0.292; however, robustness slightly declines, as indicated by an increase in standard deviation from 0.133 to 0.156
Similarly, scaling up the Gemma models increases average performance (from 0.153 in the 2b model to 0.31 in the 7b model) but results in lower robustness (0.191 compared to 0.118 in the 2b model).

‍

Having identified the worst-performing prompts for different models, the study next explored specific trends, particularly:

Whether the worst prompts overlapped across models
Whether prompt rankings were consistent across various models

The following graph tells the story:

‍

A bar char with four lines — The overlap rate of model-agnostic worst-k prompts across different models

‍

The consistency between the worst prompts across all models (the red line) is really low. This shows that there is no such thing as a “bad prompt”, it is all relative to the model.
You see some better consistency when looking within the same family of models, but the rate is still low. This suggests that even individual models within the same family will have their own unique strengths and weaknesses
It is essentially impossible to characterize the worst prompts without knowing the model

‍

For more information on how models differ in relation to price, performance, and other key metrics, check out our LLM Model Card Directory.

Who is better at picking the better prompt, humans or LLMs?

Remember that first example we looked at, where I asked you to see if you could guess which prompt scored better? If you got it wrong, don’t feel bad, you would probably did as well as ChatGPT.

The researchers tested the model’s ability to discern prompt quality by presenting it with two prompts and asking it to pick the one that would “be more likely to yield a more helpful, accurate, and comprehensive response". I was shocked at the performance.

‍

A table with four models and their scores on guessing which prompt performed best

‍

All the models were right around the 50% mark, which is the same performance you would get if you just guessed randomly. So if the models can’t discern what a better performing prompt looks like, how could we depend on them to write prompts for us?

‍

Wrapping up

Prompts are tricky, LLMs are tricky, this whole new stack brings a set of problems that we aren’t used to solving for in traditional software development. Slight changes in a prompt can significantly impact performance. This underscores the importance of having a prompt management tool (we can help with that) where you can easily test, version, and iterate on prompts as new models continue to come out.

Dan Cleary

Founder