“Pretend you are a JSON structurer”, “You are an expert sentiment classifier”. Chances are you’ve tested out including a persona or role in your prompts to try and steer the model. Maybe your prompts today have personas in them.
I’ve done this a lot as well. It seemed like a no-brainer best practice. I never really questioned it. So I decided to dive deeper and see just how effective this method was. That’s when I found that the research is torn with regard to how effective role prompting is.
For example, When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models, and Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks both make a strong case against role prompting, saying it can even lead to a degradation in performance.
On the other side are papers like ExpertPrompting: Instructing Large Language Models to be Distinguished Experts and Better Zero-Shot Reasoning with Role-Play Prompting, which show increased performance with specific types of role prompting.
By the end of this post, I aim to provide a clear picture of when and where persona prompting can offer an edge, and where it falls short. We’ll break down the findings across these studies, share various prompt templates to automate persona creation, and give practical guidance on implementing persona prompting effectively.
What is persona prompting
Persona prompting is a prompt engineering technique where you assign a specific role or persona to a Large Language Model (LLM) to influence how it responds. The goal in assigning the model a specific persona, such as a 'math expert,' or 'supportive mentor,' is to guide its tone, style, or reasoning approach to better align with the task at hand.
For insights on how major AI companies handle personas, see our post, What We Can Learn from OpenAI, Perplexity, TLDraw, and Vercel's System Prompts.
You can also check out these examples of using roles in the system prompts that power the Claude.ai interface via our templates (below), or in Anthropic’s documentation.
Is role prompting effective for accuracy-based tasks
Alright with definitions out of the way, let’s dive into the details. Our goal is to determine whether persona prompting can improve performance and, if so, on what types of tasks. We’ll look at a variety of sources on both sides.
Starting with the results from Better Zero-Shot Reasoning with Role-Play Prompting, researchers were able to increase accuracy on the AQuA dataset (a collection of algebraic and word problems) from 53.5% to 63.8% using GPT-3.5 (gpt-3.5-turbo-0613).
The researchers implemented persona prompting in a somewhat novel way. Instead of using a simple prompt like “pretend you’re a mathematician,” they used a two-stage role immersion approach that includes a Role-Setting Prompt and a Role-Feedback Prompt.
- Role-Setting Prompt: A user-designed prompt that assigns the persona.
- Role-Feedback Prompt: The model’s response to the Role-Setting Prompt. It’s meant to serve as the model’s acknowledgement to the role it has been assigned. The goal is that this prompt will further anchor the model in the provided role.
One important thing to note is that to get the best Role-Feedback Prompt the researchers ran the Role-Setting Prompt many times and selected the “best” one essentially. Here’s an example of how this works in practice.
So now, with each user query, three prompts are being sent: the Role-Setting Prompt, the Role-Feedback Prompt, and the user message. Here’s what the request looks like:
prompt_1 = role_setting_prompt
prompt_2 = role_feedback_prompt
conversation = [
{"role": "user", "content": prompt_1},
{"role": "assistant", "content": prompt_2},
{"role": "user", "content": question}
]
answer = openai.ChatCompletion.create( model="gpt-3.5-turbo-0613",
Below is a comparison of different prompt designs when assigning the role of a math teacher.
Experiment results
The researchers tested across four task categories using gpt-3.5: arithmetic, common sense reasoning, symbolic reasoning, and data understanding and tracking.
They also tested across three different sizes of Llama 2-Chat models on three datasets. Llama 2-Chat 7B, Llama 2-Chat 13B, Llama 2-Chat 70B
The takeaway
The researchers showed that their implementation of role-play prompting consistently outperformed both few-shot prompting and Chain-of-Thought prompting. However, a few key points are worth noting:
- The Role-Setting Prompt is hand-crafted, which can be challenging, as it's unclear what the most effective role is.
- The Role-Feedback Prompt required multiple runs to finalize a response, adding complexity.
- Sending three messages per request could be costly in terms of both expense and latency.
- This three-message approach isn’t typically what most people think of as "role prompting."
- Testing was limited to GPT-3.5; it’s unclear if these results would hold with newer models like GPT-4 or Claude 3.5 Sonnet.
Now over to the other side of this argument, we’ll take a look at the next paper, When "A Helpful Assistant" is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
Perhaps the most telling fact about this paper is that it used to be called “Is A Helpful Assistant" the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts”.
Originally, the abstract stated “Through extensive analysis of 3 popular LLMs and 2457 questions, we show that adding interpersonal roles in prompts consistently improves the models' performance over a range of questions.”
Now the abstract says “Through extensive analysis of 4 popular families of LLMs and 2,410 factual questions, we demonstrate that adding personas in system prompts does not improve model performance across a range of questions compared to the control setting where no persona is added”
While I’ve already given away the headline, let’s dig in here.
The researchers tested a variety of personas across thousands of factual questions (MMLU) and 4 model families. The findings were:
- Overall, adding personas in system prompts didn’t improve performance. In some cases, it led to negative effects.
- In the cases when the persona prompt led to better performance, it was not clear as to the best strategy to consistently choose the best persona. None of the strategies for picking personas outperformed random selection.
- In regard to the effect of different personas the researchers found that gender-neutral, in-domain, and work-related roles lead to better performance, but with relatively small effect sizes.
- The effect size of domain alignment (such as assigning a "lawyer" persona for legal tasks) was quite small, suggesting that it only has a minor impact on LLM performance.
- The similarity between the persona and the question is the strongest predictor of final performance. But figuring out exactly what that persona should be is still a challenge
The researchers designed two types of prompts for all of their experiments:
- Speaker-Specific Prompt: Prompts that assign a role to the model. “You are a doctor”
- Audience-Specific Prompt: Prompts that specify the audience of the conversation. “You are talking to a lawyer”.
As you can see in the image above, the prompt template was very simple:
The researchers tested the prompts across nine open-source models: FLAN-T5-XXL (11B), Llama-3-Instruct-8B, Llama-3-Instruct-70B), Mistral-7B-Instruct-v0.2, and Qwen2.5-Instruct (3B to 72B).
Below is a graph showing the ten best (yellow) and worst (green) personas. Notice that all values are below zero, indicating that none of the personas led to statistically significant improvements in model performance.
The Takeaway
Clearly, the persona prompts tested in this paper didn't lead to consistent performance gains. More importantly, the findings show that predicting which role might yield the best improvement in performance is extremely difficult and not intuitive.
The fact that this paper originally supported persona prompting but now advises against it for accuracy-based tasks tells the whole story
If I had to push back and advocate for role prompting, one thing I would note is that the templates used were very basic. It would have been interesting to see whether a more detailed persona prompt, with additional context, might have produced different results.
The next paper we’ll look at is Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks, which, as the name suggests, presents both pros and cons regarding the use of personas in prompts.
One key issue this paper addresses is the challenge of assigning a persona that will perform well. As seen in the previous paper, it can be difficult to predict the optimal persona, and LLMs are often highly sensitive to the roles they’re assigned. To address this, the researchers developed a framework they called Jekyll & Hyde.
The framework has a few steps:
- Persona generator (template below): Use an LLM to generate a persona for a task
- Solver: Generate solutions to the task through two prompts, one that includes the persona, one that doesn’t
- Evaluator: Both outputs are sent through an LLM evaluator, and the better solution is chosen
You can access the persona generator template in PromptHub here.
Here's the flow of information through the framework:
Here are the main results from the experiments.
- In some cases (AQuA dataset) the Jekyll and Hyde framework leads to large gains
- In other cases, the difference in performance can be quite small, and sometimes the Base prompt outperforms the Persona prompt (GPT-4: Multiarith, LLama 3-8b, Single Eq).
- On average, Jekyll & Hyde outperforms the baselines by an average of 9.98%, when using GPT-4 as the backbone model
- It isn’t that impressive that Jekyll & Hyde outperforms the other methods considering it includes multiple LLM calls (ensembling and an evaluator step)
- Every category tested contains a dataset in which Base outperforms Persona, proving again that role-playing prompts can degrade performance
- Look at how small the gap is between “Persona” and “Base” for GPT-4.
Takeaways
This paper shows that even with a framework like Jekyll & Hyde, persona prompting can degrade model performance. Additionally, this study used newer models, like GPT-4, and illustrated that the gap between “Base” prompting and “Persona” prompting is minimal.
This reinforces the idea from the previous paper, that “typical” persona prompting(”You are a lawyer”) isn’t going to improve performance on accuracy based tasks, especially for newer models.
Taking persona prompting a step further with a framework like Jekyll & Hyde might help increase performance, but since there are multiple components in the framework, it is hard to determine if the gains are due to the persona itself or the other elements (ensembling or the evaluator).
Next up is ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. While this paper came out in late May of 2023, I think there are still some valuable insights.
ExpertPrompting follows a similar process as the Jekyll & Hyde framework, but in a simpler way, with better prompt engineering and a clearer focus.
ExpertPrompting is consists of two steps
- An instruction is passed to an LLM to generate an expert identity
- The generated identity and original instruction are then sent to a model
The researchers focus on three major aspects when developing ExpertPrompting:
- Distinguished: The persona description should be customized to each specific instruction, and the expert should be specialized in that area
- Informative: The description should be detailed and comprehensive to cover all the necessary information related to the expert agent
- Automatic: The creation of the expert agent description should be automatic and simple
For each task, ExpertPrompting defines an expert identity, which is prepended to the task.
The reason I say that this method has better prompt engineering is that the prompt template (copied below) used to generate the expert identity is much more robust, using In-Context Learning to help guide the model.
You can access the ExpertPrompt template in PromptHub here.
The researchers tested three prompt engineering methods:
- Vanilla Prompting: The basic prompt
- Vanilla Prompting + Static DESC: The basic prompt plus a basic persona prompt addition:
(”Imaging you are an expert in the regarding field, try to answer the following instruction as professional as possible.{Instruction}”)
- Expert Prompting: Uses the LLM-generated persona via the template above
They tested across a few models, including GPT-3.5
The results are below.
- Vanilla Prompting and Vanilla Prompting + Static DESC performed very similarly, which reinforces findings from previous studies that a basic persona prompt doesn’t lead to improved performance.
- Expert Prompting destroys the other methods
While we don’t know how the results would look with a newer model like GPT-4o-mini, this paper still shows that a more elaborate and detailed persona prompt outperforms a simple one.
The last resource we’ll look at is from the team at Learn Prompting who were also skeptical of persona prompting improving performance on accuracy based tasks.
Their findings can be summed up in just a few sentences and one graph.
The team tested twelve role prompts on 2,000 MMLU questions using GPT-4-turbo. The results (below) were pretty consistent across all the personas. Plus, the “genius” persona performed worse than the “idiot” persona.
When role prompting is most useful
While persona prompting shows mixed results in accuracy-based tasks, it is undoubtedly effective on open-ended tasks, like creative writing. If you tell ChatGPT to talk like a pirate, it will talk like a pirate.
So for tasks like content creation or engagement-focused interactions, persona prompting can be a great tool to ensure the model’s response is more aligned with the desired tone and style.
Additionally, persona prompting can support security efforts. Establishing guardrails through a persona in a system prompt is one way to help ensure safer and more controlled outputs.
How to construct effective personas for role prompts
If you decide you want to test out using a persona, I think the ExpertPrompting framework is an effective way to go about it.
Your persona definition should be specific, detailed, and automated.
- Specific: The role description should be in the same domain as the task, and should be as specific as possible.
- Detailed: The role description should be detailed and comprehensive to cover all the necessary information related to the expert persona. Simple descriptions like “You are a mathematician” don’t perform well.
- Automated (and simple): LLM-generated personas outperform human-written ones, and LLMs are much more scalable
Leveraging the ExpertPrompt framework to generate your persona is a great starting point.
You can access the ExpertPrompt template in PromptHub here.
Conclusion
To put it briefly, after diving into all the research here are the takeaways:
- Persona prompting is effective on open-ended tasks (Ex: creative writing)
- Persona prompting probably won’t help much on accuracy-based tasks (Ex: classification), especially for newer models
- If using a persona, it should be specific, detailed, and ideally automated, with the ExpertPrompting framework as a strong starting point