One of the first ever A/B tests I ran in PromptHub was to see the output difference after adding the word “please” onto the end of a prompt. I’ve been waiting for a research paper to come out on the topic and the day is finally here!
We’ll be diving into two papers, but this one will be the main focus:
Let’s finally answer the question: Does being polite to LLMs help get better outputs?
Previous works
A popular research paper came out earlier this year titled “Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4”. We put together a rundown of the paper here.
The researchers tested 26 different prompt engineering principles, one of them related to using polite language in prompts.
The results noted above are for GPT-4, If you want to see the performance metrics for GPT-3.5 and access all the data via a Google Sheet, join our newsletter and you'll get it in your inbox.
Looking at the last two columns above, the researchers found that adding supplementary polite phrases didn’t increase the output quality.
However, in the same paper, the 26th principle tested included the word 'please'.
So should you be polite?
I’ve always assumed being polite probably helps, and shouldn’t hurt output quality. But is that the case? Let’s dive into some experiments, results, and takeaways.
Experiment setup
The researchers tested the impact of politeness in prompts across English, Chinese, and Japanese tasks. We’ll focus mostly on the experiments related to the English tasks.
Models used: GPT-3.5-Turbo, GPT-4, Llama-2-70b-chat
The researchers tested the impact of politeness across three tasks:
- Summarization: Observing the effect of prompt politeness on the conciseness and accuracy of summarizing articles from CNN/DailyMail
- Language Understanding Benchmarks: Testing comprehension and reasoning abilities
- Stereotypical Bias Detection: Examining the LLMs' propensity to exhibit biases by assessing responses as positive, neutral, negative, or refusal to answer.
The researchers designed eight prompt templates for each language, varying from highly polite to extremely impolite.
Evaluations
- Summarization: BERTScore and ROUGE-L metrics evaluated the quality and relevance of generated summaries.
- Language Understanding: Accuracy was measured by comparing LLM responses to correct answers
- Bias Detection: A Bias Index (BI) calculated the frequency of biased responses
Results
Summarization Tasks
Here are the summarization prompts that were used and the experiment results:
Takeaways
- ROUGE-L and BERTScore scores stay consistent regardless of the politeness level
- For the GPT models, as the politeness level decreases, so does output length
- For Llama2-70B, the length tends to decrease as politeness decreases, but then surges when using extremely impolite prompts
- One potential reason for the trend of outputs being longer at higher levels of politeness is that polite and formal language is more likely to be used in scenarios that require descriptive instructions
Language Understanding Benchmarks
Performance on these tasks were much more sensitive to prompt politeness
Here are the prompts that were used and the results:
Takeaways
- On average, the GPT model’s best performing prompts were in the middle of the spectrum. Not overly polite, not rude.
- While the scores gradually decrease at lower politeness levels, the changes aren’t always significant. The most significant drop-offs happen at the lowest levels of politeness.
- GPT-4’s scores are more stable than GPT-3.5 (no dark tiles in the heat-map). With advanced models, the politeness level of the prompt may not be as important
- Llama2-70B fluctuates the most. Scores scale proportionally to the politeness levels
Bias detection
Let's look at the prompts used and the results:
Takeaways
- In general, moderately polite prompts tended to minimize bias the most
- Extremely polite or impolite prompts tended to exacerbate biases, and increased the chance that the model would refuse to respond.
- Although Llama appears to show the lowest bias, it refused to answer questions much more often, which is its own type of bias
- Overall, GPT-3.5’s stereotype bias is higher than GPT-4, which is higher than Llamas
- Although the model’s bias tends to be lower in cases of extreme impoliteness, this is often because the model will refuse to answer the question
- GPT-4 is much less likely to refuse to answer a question
- A politeness level of 6 seems to be the sweet spot for GPT-4
In general we see high bias at both extremes. Thinking to human behavior, perhaps this is because in highly respectful and polite environments, people feel like they can express their true thoughts without being concerned about moral constraints. At lower ends, rude language can lead to a sense of offense and prejudice.
Wrapping up
The tl;dr of this paper is that you want to be in the middle. You don’t want to be overly polite, and you don’t want to be rude. Another nuance that we didn’t cover because we focused on the results from the english experiments, is that models trained in a specific language are susceptible to the politeness of that language. If your user base spans many different cultures and languages, you should keep this in mind as you develop your prompts.