Table of Contents

“Pretend you are a JSON structurer”, “You are an expert sentiment classifier”. Chances are you’ve tested out including a persona or role in your prompts to try and steer the model. Maybe your prompts today have personas in them.

I’ve done this a lot as well. It seemed like a no-brainer best practice. I never really questioned it. So I decided to dive deeper and see just how effective this method was. That’s when I found that the research is torn with regard to how effective role prompting is.

For example, When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models, and Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks both make a strong case against role prompting, saying it can even lead to a degradation in performance.

On the other side are papers like ExpertPrompting: Instructing Large Language Models to be Distinguished Experts and Better Zero-Shot Reasoning with Role-Play Prompting, which show increased performance with specific types of role prompting.

By the end of this post, I aim to provide a clear picture of when and where persona prompting can offer an edge, and where it falls short. We’ll break down the findings across these studies, share various prompt templates to automate persona creation, and give practical guidance on implementing persona prompting effectively.

Hey everyone, how's it going? Dan here, happy Saturday. Today we're going to be talking about role prompting, and I'll kind of interchangeably call it persona prompting just because most people refer to it as either. We'll talk about what it is, whether it's effective or not, when it is most effective, and if you're going to use it, how to do it in the best way possible.

Introduction

I was kind of led down this rabbit hole because it was something that I had done a lot—you know, throwing something in the system message or in the prompt that says, "Pretend you're XYZ," "Act like..." I feel like it's been a best practice for a while, kind of unstated, and so I wanted to dive deeper and see, okay, what effect does it really have?

What is Role Prompting / Persona Prompting?

In terms of just level setting, when we're talking about persona or role prompting, we're talking about when you assign a specific role or persona to an LLM to influence its response. So, "Pretend you're a JSON structure sentiment classifier," "Talk like a lawyer," so on and so forth.

Effectiveness of Role Prompting

I wanted to see, does this actually help with accuracy-based tasks? Because there's no argument as to whether it affects creative writing type tasks—like if you tell the model to talk like a cowboy, it certainly will—but what about tasks where you're focused on accuracy?

Paper: "Better Zero-Shot Reasoning with Role-Play Prompting"

The first paper is called "Better Zero-Shot Reasoning with Role-Play Prompting" (March 2024), so relatively recent. The only caveat is they use GPT-3.5. It happens a lot with these papers where they don't use the latest models, so it's hard to take away the same findings if we're using a much more capable model. But they talk about a pretty big increase—from 53% to 63% on the AQUA dataset, which is like an accuracy math type of dataset.

Diving in, they implemented persona prompting in an interesting way. They didn't just add a little bit of something to the beginning of the system prompt that says, "Pretend you're XYZ." They created a user-designed prompt called the role-setting prompt, and then they would send that prompt to the model and get a response from the model. They would use that in the next request as well.

We'll look at an example. So this first one would be this user-designed role-setting prompt: "From now on you're a contestant..." and so on. They would send it and get an output. They actually would send it multiple times and then pick the best output they got back. The idea is that the model is acknowledging the role.

Once they have this second one selected from the multiple, for every additional request they would send both of these messages. So for every new request, they were sending three messages along:

  • Prompt 1: Role-setting
  • Prompt 2: Role feedback
  • The newest question

Here were some interesting prompt designs that they had in terms of the persona prompt being used. You can see generally that more context leads to higher accuracy, including when they start to add that second message in there.

Here were the different personas tested. "Advantage" means in-domain (so related), "irrelevant," and then kind of "disadvantage." You see generally a trend that the people who are in-domain and advantaged tend to score higher for the most part.

Again, this is GPT-3.5. Look at the results here. On average, I think it outperformed by like 10%, but you could kind of look at any of these and dive in. You could see in this case zero-shot prompt was actually the best. In another case, it really clearly blows it out of the water for something like symbolic reasoning.

Takeaways and Caveats

Across the board, it did have a positive effect, but I think there's a few interesting takeaways and caveats:

  • The role-setting prompt was handwritten, which can be challenging and time-consuming.
  • They are selecting the best role feedback prompts, which could take many runs.
  • Sending three messages for every additional user message can be costly in terms of cost and latency.
  • It's not typically what we think of when we think of persona prompting, and it takes additional setup.
  • It was done with GPT-3.5, so it's hard to know if this would hold for newer models.

Paper: "Role-Playing is Not All You Need"

A more recent one, I think this was October, with the great title "Role-Playing is Not All You Need", kind of looks at the opposite side of persona prompting. The most interesting part is that the paper used to be a pro-persona paper. In November of last year (2023), it basically said having personas improves the performance, and then they updated it and said it does not improve. So that's kind of all you really need to know.

They tested a bunch of different models, four families across thousands of fact-based questions. Here's what the templates look like—they are very basic. These are very much like "Hey, you are an X" or "You're talking to an X."

What they found was across all thousands of factual-based questions, system prompts or personas in system prompts didn't improve performance, and sometimes it had negative effects. In the cases that it did have positive effects, it was small, and they couldn't reverse engineer how to pick the best persona. They tried a bunch of different methods and couldn't find something. They likened the chance of picking the best one to basically just randomly picking from the assortment that they had.

You would think the in-domain ones would work the best, and they do, but the effect size is so small that knowing how to pick the best one was clearly a challenge, and I think that's my biggest takeaway from this.

Paper: "Jekyll and Hyde" Framework

Another one that takes both sides—this one was a persona prompt—they introduced something called the "Jekyll and Hyde" framework, which is a framework that has multiple LLM calls and an evaluator.

First step is to automatically generate a persona based on the task. So if it's a math task, generate a person based on that through a prompt. Then they would have the problem be solved two times, one with the persona and one without. So if it's "Hey, what's 2 plus 2," they would run that once, and then they would also run it with "Hey, you're a mathematician, what's 2 plus 2," and then both outputs were sent to an evaluator, and the better solution was chosen.

Here's a little bit of that flow:

  • Persona gets generated.
  • It goes to the solver: one with the persona included, one without.
  • It goes through this evaluator workflow.

They included the template here, which we've added into PromptHub so you can try it out—we have a better one later, so sneak peek there.

They did use GPT-4 and Llama 3, which is cool to see.

Observations

In some cases, the framework leads to better outputs; in some cases, it doesn't. Something that I thought was interesting is that the difference between persona and base for GPT-4 is really slim in some cases. Looking at base and persona, we can see it's very close—we're talking about really small performance differences a lot of the time. Here we see a large one, and for some use cases that is important—that 0.5% increase is going to be actually material.

You can see even in this case, we're talking about really small amounts, and base and persona are very close. So I think for smarter models, the effect of using a simple persona that we saw here is not really going to do a ton. Additionally, judging a base prompt against a framework isn't quite fair, for lack of a better word. This Jekyll and Hyde framework has two outputs, an evaluator, and it's barely outperforming the base prompt.

Like I said, that 1%, half a percent, in some cases 2% better outputs is important for some people. For others who don't want to go ahead and implement a whole framework, you don't have to.

Paper: "Expert Prompting" Framework

Next up is my favorite. This one is older, which I will note—it's from basically a year ago at this point in 2023—but I still think it's maybe the strongest evidence of pro-persona prompting. They introduced a framework as well; it's much more straightforward.

Basically, the instruction goes to an LLM to generate a persona, and then that identity is included in the system message, and then the original instruction is sent as the user message to the LLM to process. So if the instruction is "Describe the structure of an atom," this identity gets generated. They show here side by side what it looks like with and without the identity.

They have a really great template for this as well. They use in-context learning, which you can kind of see here where we have instruction, agent description, instruction, agent description, which I think was really helpful and why I like it a lot better than the persona generator in the Jekyll and Hyde framework.

Results

In this case, they tested a few variants:

  • Vanilla prompting
  • Vanilla prompting with a static description like "Imagine you're an expert in this field, solve this instruction"
  • Expert prompting using the generated persona

We see why I think the study is great and really telling—it reinforces a lot of the other things that we've seen in the other papers. The vanilla prompting and the vanilla prompting with a static description are barely separated, whereas this expert prompting one takes the cake. Yes, they did use an older model, but I think it reinforces the point that a basic persona that we saw in almost all the other papers is definitely not going to increase accuracy, and I think we can be pretty confident in that. But if you use a more exhaustive and comprehensive persona, that can actually lead to accuracy gains, and so that's what really hits home for me in this case.

Experiment by Learn Prompting

Lastly, we'll check out a resource from the folks at Learn Prompting who had a similar hunch about this this summer, specifically by S. They ran an experiment that was supposed to be part of their prompt report—I'm not sure if it is part of it or it's going to be included in a future edition—but basically they took a bunch of MMLU questions and tested it across a variety of different personas.

Here you can see the personas: farmer, police officer, etc. They also did other methods as well. Firstly, you can see the best method was two-shot chain-of-thought. Most importantly, the most interesting part is you could see that there's a "genius" and an "idiot" persona, and the "idiot" persona outperforms the "genius" one, which is kind of like, how do you—what do you do with that in terms of how can you make a credible argument against that?

Here are the prompts themselves, which they shared in that resource, which will also be linked below. These are pretty built out. More importantly, judging these personas versus the personas generated in the expert prompting framework, they differ. This talks about "You always get problems correct," kind of talking about that person versus how they're supposed to act or who they are. So I do think there is still some nuance here to be discovered.

When is Role Prompting Most Useful?

Again, in terms of when role prompting is most useful—as I mentioned earlier—any type of creative writing. So, you know, talk like a cowboy, whatever—it will do that. People also introduce guardrails via persona prompts as well, and I think that's an interesting layer. You obviously would need to do a little bit more than that, but it's an interesting thing to keep in mind.

Something that we can take away from the expert prompting framework is that if you're going to create a persona, it needs to be specific, detailed, and it should be automated, I would say. Throughout a variety of these papers, they also tested human versus LLM-generated personas, and the LLM ones almost always performed better. Again, the expert prompt framework and template are a great starting point for this.

Conclusion

So yeah, good for open-ended tasks. I would say it's generally not beneficial for strictly accuracy-based tasks, especially for newer models. If you are using simple persona definitions, I still think you could probably increase accuracy if you were to use more exhaustive and comprehensive personas—specific and detailed, as we just talked about—although there's not as much evidence there. If you are going to do it, expert prompting is a good way to start.

Cool, a little bit longer today, but it was really fascinating to kind of go down this rabbit hole, and there's so much new stuff coming out all the time and things that we're learning and we're passing on. So I hope it helps. Thanks.

What is persona prompting

Persona prompting is a prompt engineering technique where you assign a specific role or persona to a Large Language Model (LLM) to influence how it responds. The goal in assigning the model a specific persona, such as a 'math expert,' or 'supportive mentor,' is to guide its tone, style, or reasoning approach to better align with the task at hand.

For insights on how major AI companies handle personas, see our post, What We Can Learn from OpenAI, Perplexity, TLDraw, and Vercel's System Prompts.

You can also check out these examples of using roles in the system prompts that power the Claude.ai interface via our templates (below), or in Anthropic’s documentation.

How to automatically generate a persona for your task

We recently launched new prompt enhancers in PromptHub, including the ability to generate a persona based on your prompt. Feel free to try it out for free in PromptHub - it's available on all plans!

Is role prompting effective for accuracy-based tasks

Alright with definitions out of the way, let’s dive into the details. Our goal is to determine whether persona prompting can improve performance and, if so, on what types of tasks. We’ll look at a variety of sources on both sides.

Starting with the results from Better Zero-Shot Reasoning with Role-Play Prompting, researchers were able to increase accuracy on the AQuA dataset (a collection of algebraic and word problems) from 53.5% to 63.8% using GPT-3.5 (gpt-3.5-turbo-0613).

The researchers implemented persona prompting in a somewhat novel way. Instead of using a simple prompt like “pretend you’re a mathematician,” they used a two-stage role immersion approach that includes a Role-Setting Prompt and a Role-Feedback Prompt.

  • Role-Setting Prompt: A user-designed prompt that assigns the persona.
  • Role-Feedback Prompt: The model’s response to the Role-Setting Prompt. It’s meant to serve as the model’s acknowledgement to the role it has been assigned. The goal is that this prompt will further anchor the model in the provided role.

One important thing to note is that to get the best Role-Feedback Prompt the researchers ran the Role-Setting Prompt many times and selected the “best” one essentially. Here’s an example of how this works in practice.

Example implementation of two step process for persona prompting

So now, with each user query, three prompts are being sent: the Role-Setting Prompt, the Role-Feedback Prompt, and the user message. Here’s what the request looks like:

prompt_1 = role_setting_prompt

prompt_2 = role_feedback_prompt

conversation = [

{"role": "user", "content": prompt_1},

{"role": "assistant", "content": prompt_2},

{"role": "user", "content": question}

]

answer = openai.ChatCompletion.create( model="gpt-3.5-turbo-0613",

Below is a comparison of different prompt designs when assigning the role of a math teacher.

Different prompts and their performance metrics
The trend appears to be: more context → higher accuracy.

A table showing the performance of different roles and categories
A breakdown on how the role type affected performance. We’ll have data that opposes this directly later.

Experiment results

The researchers tested across four task categories using gpt-3.5: arithmetic, common sense reasoning, symbolic reasoning, and data understanding and tracking.

Table of results from the experiments

They also tested across three different sizes of Llama 2-Chat models on three datasets. Llama 2-Chat 7B, Llama 2-Chat 13B, Llama 2-Chat 70B

Performance across three variants of Llama 2
The data format is 7B / 13B / 70B

The takeaway

The researchers showed that their implementation of role-play prompting consistently outperformed both few-shot prompting and Chain-of-Thought prompting. However, a few key points are worth noting:

  • The Role-Setting Prompt is hand-crafted, which can be challenging, as it's unclear what the most effective role is.
  • The Role-Feedback Prompt required multiple runs to finalize a response, adding complexity.
  • Sending three messages per request could be costly in terms of both expense and latency.
  • This three-message approach isn’t typically what most people think of as "role prompting."
  • Testing was limited to GPT-3.5; it’s unclear if these results would hold with newer models like GPT-4 or Claude 3.5 Sonnet.

Now over to the other side of this argument, we’ll take a look at the next paper, When "A Helpful Assistant" is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Perhaps the most telling fact about this paper is that it used to be called “Is A Helpful Assistant" the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts”.

Originally, the abstract stated “Through extensive analysis of 3 popular LLMs and 2457 questions, we show that adding interpersonal roles in prompts consistently improves the models' performance over a range of questions.”

Now the abstract says “Through extensive analysis of 4 popular families of LLMs and 2,410 factual questions, we demonstrate that adding personas in system prompts does not improve model performance across a range of questions compared to the control setting where no persona is added”

While I’ve already given away the headline, let’s dig in here.

The researchers tested a variety of personas across thousands of factual questions (MMLU) and 4 model families. The findings were:

  • Overall, adding personas in system prompts didn’t improve performance. In some cases, it led to negative effects.
  • In the cases when the persona prompt led to better performance, it was not clear as to the best strategy to consistently choose the best persona. None of the strategies for picking personas outperformed random selection.
  • In regard to the effect of different personas the researchers found that gender-neutral, in-domain, and work-related roles lead to better performance, but with relatively small effect sizes.
  • The effect size of domain alignment (such as assigning a "lawyer" persona for legal tasks) was quite small, suggesting that it only has a minor impact on LLM performance.
  • The similarity between the persona and the question is the strongest predictor of final performance. But figuring out exactly what that persona should be is still a challenge

The researchers designed two types of prompts for all of their experiments:

  • Speaker-Specific Prompt: Prompts that assign a role to the model. “You are a doctor”
  • Audience-Specific Prompt: Prompts that specify the audience of the conversation. “You are talking to a lawyer”.

Prompt template examples

As you can see in the image above, the prompt template was very simple:

The researchers tested the prompts across nine open-source models: FLAN-T5-XXL (11B), Llama-3-Instruct-8B, Llama-3-Instruct-70B), Mistral-7B-Instruct-v0.2, and Qwen2.5-Instruct (3B to 72B).

Below is a graph showing the ten best (yellow) and worst (green) personas. Notice that all values are below zero, indicating that none of the personas led to statistically significant improvements in model performance.

A graph showing the performance of the ten best and worst personas in a prompt

A series of bar charts showing the performance effect of personas on different models
Most of the personas have no or negative impact on the models performance.

Persona gender performance
Gender neutral roles outperformed masculine roles, which outperformed feminine roles, but with a very small effect size.

The Takeaway

Clearly, the persona prompts tested in this paper didn't lead to consistent performance gains. More importantly, the findings show that predicting which role might yield the  best improvement in performance is extremely difficult and not intuitive.

The fact that this paper originally supported persona prompting but now advises against it for accuracy-based tasks tells the whole story

If I had to push back and advocate for role prompting, one thing I would note is that the templates used were very basic. It would have been interesting to see whether a more detailed persona prompt, with additional context, might have produced different results.

The next paper we’ll look at is Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks, which, as the name suggests, presents both pros and cons regarding the use of personas in prompts.

One key issue this paper addresses is the challenge of assigning a persona that will perform well. As seen in the previous paper, it can be difficult to predict the optimal persona, and LLMs are often highly sensitive to the roles they’re assigned. To address this, the researchers developed a framework they called Jekyll & Hyde.

The framework has a few steps:

  • Persona generator (template below): Use an LLM to generate a persona for a task
  • Solver: Generate solutions to the task through two prompts, one that includes the persona, one that doesn’t
  • Evaluator: Both outputs are sent through an LLM evaluator, and the better solution is chosen

You can access the persona generator template in PromptHub here.

Persona Generator prompt template in PromptHub

Here's the flow of information through the framework:

Jekyll & Hyde persona prompting framework

Here are the main results from the experiments.

Results from the arithmetic datasets for persona prompting

  • In some cases (AQuA dataset) the Jekyll and Hyde framework leads to large gains
  • In other cases, the difference in performance can be quite small, and sometimes the Base prompt outperforms the Persona prompt (GPT-4: Multiarith, LLama 3-8b, Single Eq).
  • On average, Jekyll & Hyde outperforms the baselines by an average of 9.98%, when using GPT-4 as the backbone model
  • It isn’t that impressive that Jekyll & Hyde outperforms the other methods considering it includes multiple LLM calls (ensembling and an evaluator step)
  • Every category tested contains a dataset in which Base outperforms Persona, proving again that role-playing prompts can degrade performance
  • Look at how small the gap is between “Persona” and “Base” for GPT-4.

Different persona prompt templates
Interestingly, persona outperformed persona + task description.

Takeaways

This paper shows that even with a framework like Jekyll & Hyde, persona prompting can degrade model performance. Additionally, this study used newer models, like GPT-4, and illustrated that the gap between “Base” prompting and “Persona” prompting is minimal.

This reinforces the idea from the previous paper, that “typical” persona prompting(”You are a lawyer”) isn’t going to improve performance on accuracy based tasks, especially for newer models.

Taking persona prompting a step further with a framework like Jekyll & Hyde might help increase performance, but since there are multiple components in the framework, it is hard to determine if the gains are due to the persona itself or the other elements (ensembling or the evaluator).

Next up is ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. While this paper came out in late May of 2023, I think there are still some valuable insights.

ExpertPrompting follows a similar process as the Jekyll & Hyde framework, but in a simpler way,  with better prompt engineering and a clearer focus.

ExpertPrompting is consists of two steps

  1. An instruction is passed to an LLM to generate an expert identity
  2. The generated identity and original instruction are then sent to a model

The researchers focus on three major aspects when developing ExpertPrompting:

  1. Distinguished: The persona description should be customized to each specific instruction, and the expert should be specialized in that area
  2. Informative: The description should be detailed and comprehensive to cover all the necessary information related to the expert agent
  3. Automatic: The creation of the expert agent description should be automatic and simple

For each task, ExpertPrompting defines an expert identity, which is prepended to the task.

Example of a generated persona prompt through the ExpertPrompting framework

The reason I say that this method has better prompt engineering is that the prompt template (copied below) used to generate the expert identity is much more robust, using In-Context Learning to help guide the model.

You can access the ExpertPrompt template in PromptHub here.

ExpertPrompt Template in PromptHub

The researchers tested three prompt engineering methods:

  • Vanilla Prompting: The basic prompt
  • Vanilla Prompting + Static DESC: The basic prompt plus a basic persona prompt addition:
    (”Imaging you are an expert in the regarding field, try to answer the following instruction as professional as possible.{Instruction}”)
  • Expert Prompting: Uses the LLM-generated persona via the template above

They tested across a few models, including GPT-3.5

The results are below.

Small table of results showing how three different role prompting methods performed
While the models used are much less powerful than what we have today, there are still interesting takeaways

  • Vanilla Prompting and Vanilla Prompting + Static DESC performed very similarly, which reinforces findings from previous studies that a basic persona prompt doesn’t lead to improved performance.
  • Expert Prompting destroys the other methods

While we don’t know how the results would look with a newer model like GPT-4o-mini, this paper still shows that a more elaborate and detailed persona prompt outperforms a simple one.

The last resource we’ll look at is from the team at Learn Prompting who were also skeptical of persona prompting improving performance on accuracy based tasks.

Their findings can be summed up in just a few sentences and one graph.

A series of bar charts that represent performance for a variety of different personas and prompt engineering methods

The team tested twelve role prompts on 2,000 MMLU questions using GPT-4-turbo. The results (below) were pretty consistent across all the personas. Plus, the “genius” persona performed worse than the “idiot” persona.

When role prompting is most useful

While persona prompting shows mixed results in accuracy-based tasks, it is undoubtedly effective on open-ended tasks, like creative writing. If you tell ChatGPT to talk like a pirate, it will talk like a pirate.

So for tasks like content creation or engagement-focused interactions, persona prompting can be a great tool to ensure the model’s response is more aligned with the desired tone and style.

Additionally, persona prompting can support security efforts. Establishing guardrails through a persona in a system prompt is one way to help ensure safer and more controlled outputs.

How to construct effective personas for role prompts

If you decide you want to test out using a persona, I think the ExpertPrompting framework is an effective way to go about it.

Your persona definition should be specific, detailed, and automated.

  1. Specific: The role description should be in the same domain as the task, and should be as specific as possible.
  2. Detailed: The role description should be detailed and comprehensive to cover all the necessary information related to the expert persona. Simple descriptions like “You are a mathematician” don’t perform well.
  3. Automated (and simple): LLM-generated personas outperform human-written ones, and LLMs are much more scalable

Leveraging the ExpertPrompt framework to generate your persona is a great starting point.

You can access the ExpertPrompt template in PromptHub here.

Conclusion

To put it briefly, after diving into all the research here are the takeaways:

  1. Persona prompting is effective on open-ended tasks (Ex: creative writing)
  2. Persona prompting probably won’t help much on accuracy-based tasks (Ex: classification), especially for newer models
  3. If using a persona, it should be specific, detailed, and ideally automated, with the ExpertPrompting framework as a strong starting point
Headshot of PromptHub founder Dan Cleary
Dan Cleary
Founder