One of the best ways to get better outputs from LLMs is to include examples in your prompt. This method is called few-shot prompting (a “shot” is an example). By providing examples in your prompt you're showing the model exactly what you are looking for in terms of output structure, tone, and style.
This guide dives deep into everything related to few shot prompting (also known as few shot learning, or in-context learning). We’ll cover the different ways to use this method, when you should use it, common questions (like how many examples is best), and limitations + biases.
We’ll also be looking at tons of concrete examples and will provide free templates.
Whether you’re new to prompt engineering or currently running prompts in production, you’ll be able to get value from this guide. After reading this, you will walk away with actionable tactics you can use to enhance your prompts and get better outputs from LLMs.
What is few shot prompting?
Few shot prompting is a prompt engineering technique where you insert examples in your prompt, training the model on what you want the output to look and sound like.
This method builds on LLM’s ability to learn and generalize information from a small amount of data. This makes it particularly helpful when you don’t have enough data to fine-tune a model.
Here is a very simple example of a few-shot prompt.
The goal of the prompt is for the LLM to determine the sentiment of the movie review.
As you can see, we send three example pairs of data. This approach not only helps the model learn what we would deem as positive, negative, or neutral, but it also shows our desired output format is a single word, all lowercase.
Zero shot and one shot prompting
You may also hear about “Zero shot prompting” or “One shot prompting”. These are prompts with zero or one example, respectively, rather than a few.
Zero shot vs few shot prompting
One shot vs few shot prompting
Few shot prompting examples
Content creation
Let’s say you’re a digital marketing firm and you want to use AI to generate customized content for each of your different clients. Let’s use few-shot prompting to create a template that both:
- Creates content that is in the correct tone and style of the client
- Is scalable and adaptable to be used for any client.
Here’s what the prompt might look like:
By passing along previous briefs and content generated from those briefs, the model will get an understanding of the tone and style for the specific client. I wrapped the examples in delimiters (three quotation marks) to format the prompt and help the model better understand which part of the prompt is the examples versus the instructions.
This prompt, while basic, is adaptable to any client, all we would need to do is update the variables. You could even turn the prompt into a PromptHub form and share it with your team. By surfacing only the variables that need to change based on the client, anyone on your team can run the few-shot prompt just by updating which client they are working on and what the brief is.
Code generation few shot prompt
Let’s say we want to use an LLM to write a python function that calculates the factorial of a number.
Here's a prompt we could use:
Here’s a few shot prompt:
Here was the output for the zero shot prompt :
Here was the output for the few shot prompt :
Taking a look at the outputs, a few things stand out.
- The zero-shot prompt produced a succinct recursive factorial function, but didn't add input validation for negative integers.
- The few-shot prompt output, however, included input checks and used an iterative approach with a docstring for clarity, aligning with Python’s preference for readability.
Overall, the few-shot prompt returned a more robust function. It's more reliable, and offers better maintainability and input validation.
This is a quick example how just a little bit of work via few-shot prompting can make material differences in the outputs you get from LLMs.
Few shot prompting with multiple prompts
Another, slightly more complex, way to implement few shot prompting is by using multiple prompts to provide the examples.
This involves “pre-baking” a few user and AI messages before sending the last prompt which is the one we want the AI to respond to. We’ll use PromptHub’s chat testing feature to do this.
Following the movie sentiment example, this is how you would implement it with many prompts.
In this case you’re sending the model multiple messages. Rather than just showing it how it should respond, we've gone ahead and actually created responses. All these messages will be sent at once, giving the model an even better understanding of how it should respond.
So which method is better? It depends, but here are a few things to keep in mind.
Multiple messages might be better when:
- Simulating Interaction: The real-world application involves a back-and-forth interaction, such as in a customer service chatbot, where the model needs to understand and respond within the flow of a conversation.
- Contextual Continuity: You're aiming to maintain a narrative or contextual continuity over several interactions, and the model needs to generate responses that are coherent within the ongoing sequence.
- Incremental Complexity: The task benefits from a step-by-step buildup of context, where each message might add a layer of complexity or nuance to the conversation that a single prompt might not encapsulate.
A single prompt might be better when:
- Streamlining Processing: Efficiency is a priority and you want the model to process the examples and generate a response in one go
- Uniformity in Output: You're seeking a consistent style or format in the outputs, which may be more reliably produced if all examples are provided in a single prompt.
I’d recommend you test both out and see how they perform. You can use PromptHub’s testing tools to do this side-by-side to see how each method performs.
Few shot chain-of-thought
Another common way of implementing few-shot prompting is alongside chain-of-thought prompting. In fact, most chain-of-thought prompts leverage few-shot examples to help show the model how to reason.
For example, let's say the task at hand was to concatenate the last letter of a set of words. For an input of heat, fish, basketball, the output would be thl.
Here's an example of a few shot chain-of-thought prompt that you could use for this task.
Common questions about few shot prompting
How many examples should I include?
Adding more examples does not necessarily improve accuracy; in some cases adding more examples can actually reduce accuracy. Multiple research papers point to major gains after 2 examples and then a plateau. After 2 examples you might just be burning tokens.
Does the order of examples matter?
Yes, the order matters. The extent to which it affects output quality depends on the model you’re using. The paper, Calibrate Before Use: Improving Few-Shot Performance of Language Models, demonstrated this by altering the order of the same examples in a prompt for GPT-3. I think it is safe to assume that ‘smarter’ models should be influenced less by ordering.
The researchers found that the model's predictions varied dramatically based on the sequence of examples. In some instances, the right permutation of examples led to near state-of-the-art performance, while others fell to nearly chance levels. The graph below shows more details.
One strategy worth testing is placing your most critical example last in the order. LLMs have been known to place significant weight on the last piece of information they process.
What about the prompt format, what should come first, the examples or instructions?
While the more typical approach is to lead with the instructions followed by the examples, either approach can work and the best method might vary based on the model.
If you place the examples second and it seems like the model is either overemphasizing the last example or 'forgetting' the instructions, then consider having the instructions come last.
Another approach is to omit the instructions completely, like we did for our movie sentiment classifier. If the task is simple enough that the model can infer what to do, then basic instructions may not be necessary at all.
Few shot prompting with reasoning models
It is unclear how effective few shot prompting is for newer reasoning models like o1-preview and o1-mini from OpenAI.
For example, a recent paper, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond showed that using 5-shot prompting actually reduced performance compared to to minimal prompt baseline.
This actually aligns with OpenAI's guidance:
“Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.”
So if you want to try out few shot prompting with reasoning models, start with just an example or two, and see how things go.
For more info on prompt engineering with reasoning models and more data, check out our post: Prompt Engineering with Reasoning Models.
Five basic principles of few shot prompting for beginners
This is the part of the post where I say, "if you can only remember one section, remember this".
Here are the five major design principles to keep in mind when using few shot prompting.
- Use at least 2 examples, but you probably don't need more than 5
- Your examples need to be diverse
- Use both positive and negative examples - the LLM can learn a lot from what a "bad" output looks like
- Randomly order your examples
- Make sure your few shot examples follow a common format
When to use few shot prompting
Okay great, but when should you use few shot prompting? Luckily, few shot prompting can be applied to almost any prompt and will help you get better and more consistent outputs. Here are a few use cases where few shot prompting can be particularly helpful.
- Specialized Domains: When working in specialized fields such as legal, medical, or technical domains, where gathering vast amounts of data can be difficult, few shot prompting allows for high-quality, domain-specific outputs without the need for extensive datasets.
- Dynamic Content Creation: Ideal for tasks like content generation where consistent styles and tone are paramount.
- Strict Output Structure Requirements: Few shot prompting is particularly helpful in showing the model how you’d like your outputs to be structured.
- Customized User Experiences: In personalized applications, such as chatbots or recommendation systems, where the AI needs to quickly adjust to individual user preferences and inputs.
Why use few shot prompting
Here are some of the top reasons to try out few shot prompting:
- Resource Efficiency: Few shot prompting only requires a few example pieces of data
- Time Savings: It accelerates the model's ability to adapt to new tasks, which means quicker deployment times and faster time-to-market for AI-powered features and products.
- Cost Reduction: Compared to the time spent gathering and labeling data to fine-tune a model, few-shot prompting is considerably cheaper, especially for smaller teams.
- Small Lift, Big Gains: Setting up and testing few shot prompting is relatively easy and can help you get much better outputs.
An example from the research
It wouldn’t be a PromptHub article if we didn’t dive deep into some research. We’ll be checking out this paper from April 2024, from researchers at the University of London: The Fact Selection Problem in LLM-Based Program Repair.
Overview
The paper is centered around the use of various “facts” (examples) in prompts used to solve bugs in open-source projects on Github.
Methodology
- Fact Collection: The researchers gathered a set of bug-related examples. These included details about buggy code, error messages and other types of documentation that could be helpful when solving future bugs.
- Prompt Construction: The examples were incorporated into the prompts using few-shot prompting.
- Evaluating Impact: The researchers then evaluated how different combinations of these examples affected the model’s ability to correctly solve the bugs
Findings
- Utility of Examples: Each example contributed uniquely, highlighting the importance of having a diverse set of examples
- More examples doesn’t mean better outputs: Interestingly, adding more examples didn’t always lead to better outcomes and sometimes degraded performance if the prompt becomes too cluttered or complex. (See graph below )
- Fact Selection Model: The researchers built a statistical model named MANIPLE that algorithmically selected the most effective subset of facts for each bug, optimizing the prompt's effectiveness.
Overview of MANIPLE
The primary goal of MANIPLE is to maximize the gains from few shot prompting by identifying the optimal subset of examples that are most relevant and effective for each bug.
How MANIPLE works
- Statistical Modeling: MANIPLE examines patterns patterns from past bug fixes to decide which examples to include to get successful outcomes.
- Probabilistic Inference: The model isolates each example to determine how much it contributed positively to the successful bug fix.
- Fact Selection Optimization: Based on the probabilities, MANIPLE selects the subset of facts that maximizes the likelihood of successful bug repair.
The MANIPLE framework led to a 17% increase in bug fixes. While this setup is a little advanced, it is a great example of how you can extend few shot prompting to achieve more significant results.
Limitations and challenges of few shot prompting
As great as few shot prompting is, it isn’t perfect. The biggest limitation is its dependency on the quality and variety of the examples provided. Garbage in, garbage out, as they say.
Some times the examples can even degrade the performance, or send the model in the wrong direction.
There is also the risk of overfitting - where the model fails to generalize the examples and creates outputs that mimic the examples too closely. Additionally there are some biases to be aware of:
- Majority Label Bias: The model tends to favor answers that are more frequent in the prompt. For example, going back to our movie sentiment task, if our prompt includes more positive than negative examples, the model may be biased towards predicting a positive sentiment. The magnitude of this bias varies based on the model.
- Recency Bias: LLMs are known to favor the last chunk of information they receive. Revisiting our movie sentiment prompt, if the last few examples in a prompt are negative, the model may be more likely to predict a negative sentiment.
Wrapping up
There you have it, a comprehensive guide on what we believe to be the most effective prompt engineering method out there. Few shot prompting has the best bang for buck in relation to its accessibility and how it can drastically enhance output quality. We hope this helps you get better outputs!