The Few Shot Prompting Guide

One of the best ways to get better outputs from LLMs is to include examples in your prompt. This method is called few-shot prompting (a “shot” is an example). By providing examples in your prompt you're showing the model exactly what you are looking for in terms of output structure, tone, and style.

This guide dives deep into everything related to few shot prompting (also known as few shot learning, or in-context learning). We’ll cover the different ways to use this method, when you should use it, common questions (like how many examples is best), and limitations + biases.

We’ll also be looking at tons of concrete examples and will provide free templates.

Whether you’re new to prompt engineering or currently running prompts in production, you’ll be able to get value from this guide. After reading this, you will walk away with actionable tactics you can use to enhance your prompts and get better outputs from LLMs.

What is few shot prompting?

Few shot prompting is a prompt engineering technique where you insert examples in your prompt, training the model on what you want the output to look and sound like.

This method builds on LLM’s ability to learn and generalize information from a small amount of data. This makes it particularly helpful when you don’t have enough data to fine-tune a model.

Here is a very simple example of a few-shot prompt.

The goal of the prompt is for the LLM to determine the sentiment of the movie review.

‍

As you can see, we send three example pairs of data. This approach not only helps the model learn what we would deem as positive, negative, or neutral, but it also shows our desired output format is a single word, all lowercase.

Zero shot and one shot prompting

You may also hear about “Zero shot prompting” or “One shot prompting”. These are prompts with zero or one example, respectively, rather than a few.

Zero shot vs few shot prompting

‍

One shot vs few shot prompting

‍

Hey everyone, how's it going? It's Dan here from PromptHub. Today we are going to be looking at a really specific type of prompt engineering method called few-shot prompting. This is one of the more popular and well-known methods. We're going to go super deep on it, covering all the details and providing you with everything you need, including a bunch of templates and resources.

What exactly is few-shot prompting? Basically, it's when you send examples along in your prompt to help guide or train the model on what you want your output to look like, sound like, and the shape of it. It's basically like training the prompt, also known as in-context learning.

Let's look at an example. This is a single prompt where we are writing a prompt to classify the sentiment of a movie review. We show the model an example of a positive review, a negative review, another positive review, and then we leave the last one to be classified by the model. It will fill in the blank there. We are showing the model a few examples of what we deem as positive or negative, and we're also showing the model that we want the outputs to be one word and start with a capital letter.

Compared to zero-shot prompting, where you send the prompt with no examples, few-shot prompting includes examples to guide the model. Let's take a quick look at a few different ways you can use few-shot prompting.

First, content creation is a great use case. Many people find that when using LLMs to create content, it sounds very much like AI. The word choice and style can be hard to match your own. With few-shot prompting, you can get it to sound more human-like and match your tone or style. For example, if you are a digital marketing agency and want to create content based on briefs for clients, you can write a few-shot prompt that includes examples of previous briefs and the content created from those briefs. This can be reused once you have these examples in place.

Let's look at an example of zero-shot versus few-shot in code generation. We have a simple example where we're asking GPT-3.5 to write a Python function to calculate the factorial of a number. The zero-shot prompt just asks for the function, while the few-shot prompt includes examples of other Python functions. The zero-shot output produced a good function but didn't include input validation for negative numbers. The few-shot prompt included input checks, used an iterative approach, had a docstring for clarity, and better comments.

You can use LLMs to create examples. In content creation, you want examples that relate to your specific client, but for something like Python functions, you can use an LLM to generate those examples.

Another way to do few-shot prompting is by using different messages. Instead of stuffing all examples in one prompt, you can break it up into multiple messages. This can be useful for chat-based interactions or when you need to simulate a conversation.

Which method is better? It depends. For chat-based stuff, simulating an interaction might be a good use case for multiple messages. For a straightforward task like the movie sentiment analysis, a single prompt might be enough. Testing and experimenting is key.

We often get asked how many examples to include. While you could include many examples, research shows diminishing returns after two to three examples. Including too many examples just burns more tokens without adding much value. Typically, two to five examples are good, and we recommend not going beyond eight.

Does the order of examples matter? It varies by model, but one strategy is to place your best example last, as models are known to emphasize the last text they read.

Should instructions or examples come first? Instructions should come last if the model struggles to remember what to do. If the task is straightforward, instructions can come first, followed by examples.

When to use few-shot prompting? You can use it in any situation, but it's especially useful for technical domains, content generation to match tone, and when output requirements are specific. Current limitations include overfitting, garbage in-garbage out, majority label bias, and recency bias.

That's it for today. It's a bit longer, but few-shot prompting provides the biggest bang for your buck in prompt engineering. We have a bunch of resources linked below. Thanks for following along!

‍

Few shot prompting examples

Content creation

Let’s say you’re a digital marketing firm and you want to use AI to generate customized content for each of your different clients. Let’s use few-shot prompting to create a template that both:

Creates content that is in the correct tone and style of the client
Is scalable and adaptable to be used for any client.

Here’s what the prompt might look like:

By passing along previous briefs and content generated from those briefs, the model will get an understanding of the tone and style for the specific client. I wrapped the examples in delimiters (three quotation marks) to format the prompt and help the model better understand which part of the prompt is the examples versus the instructions.

This prompt, while basic, is adaptable to any client, all we would need to do is update the variables. You could even turn the prompt into a PromptHub form and share it with your team. By surfacing only the variables that need to change based on the client, anyone on your team can run the few-shot prompt just by updating which client they are working on and what the brief is.

‍

A form with a header, sub-header, and 3 input fields — PromptHub Forms make it easy to quickly prototype on top of prompts

‍

‍Code generation few shot prompt

Let’s say we want to use an LLM to write a python function that calculates the factorial of a number.

Here's a prompt we could use:

‍

Here’s a few shot prompt:

‍

Here was the output for the zero shot prompt :

‍

Here was the output for the few shot prompt :

‍

Taking a look at the outputs, a few things stand out.

The zero-shot prompt produced a succinct recursive factorial function, but didn't add input validation for negative integers.
The few-shot prompt output, however, included input checks and used an iterative approach with a docstring for clarity, aligning with Python’s preference for readability.

Overall, the few-shot prompt returned a more robust function. It's more reliable, and offers better maintainability and input validation.

This is a quick example how just a little bit of work via few-shot prompting can make material differences in the outputs you get from LLMs.

Few shot prompting with multiple prompts

Another, slightly more complex, way to implement few shot prompting is by using multiple prompts to provide the examples.

This involves “pre-baking” a few user and AI messages before sending the last prompt which is the one we want the AI to respond to. We’ll use PromptHub’s chat testing feature to do this.

Following the movie sentiment example, this is how you would implement it with many prompts.

‍

4 messages stacked on top of each other — Few shot prompting via multiple messages

‍

In this case you’re sending the model multiple messages. Rather than just showing it how it should respond, we've gone ahead and actually created responses. All these messages will be sent at once, giving the model an even better understanding of how it should respond.

So which method is better? It depends, but here are a few things to keep in mind.

Multiple messages might be better when:

Simulating Interaction: The real-world application involves a back-and-forth interaction, such as in a customer service chatbot, where the model needs to understand and respond within the flow of a conversation.
Contextual Continuity: You're aiming to maintain a narrative or contextual continuity over several interactions, and the model needs to generate responses that are coherent within the ongoing sequence.
Incremental Complexity: The task benefits from a step-by-step buildup of context, where each message might add a layer of complexity or nuance to the conversation that a single prompt might not encapsulate.

A single prompt might be better when:

Streamlining Processing: Efficiency is a priority and you want the model to process the examples and generate a response in one go
Uniformity in Output: You're seeking a consistent style or format in the outputs, which may be more reliably produced if all examples are provided in a single prompt.

I’d recommend you test both out and see how they perform. You can use PromptHub’s testing tools to do this side-by-side to see how each method performs.

‍

Few shot chain-of-thought

Another common way of implementing few-shot prompting is alongside chain-of-thought prompting. In fact, most chain-of-thought prompts leverage few-shot examples to help show the model how to reason.

For example, let's say the task at hand was to concatenate the last letter of a set of words. For an input of heat, fish, basketball, the output would be thl.

Here's an example of a few shot chain-of-thought prompt that you could use for this task.

‍

Common questions about few shot prompting

How many examples should I include?

Adding more examples does not necessarily improve accuracy; in some cases adding more examples can actually reduce accuracy. Multiple research papers point to major gains after 2 examples and then a plateau. After 2 examples you might just be burning tokens.

‍

a graph showing performance versus number of examples in context — Source: Language Models are Few-Shot Learners

‍

A table showing the performance on different datasets with different number of examples — K represents the number of examples.
Source: Large Language Models as Analogical Reasoners

‍

Does the order of examples matter?

Yes, the order matters. The extent to which it affects output quality depends on the model you’re using. The paper, Calibrate Before Use: Improving Few-Shot Performance of Language Models, demonstrated this by altering the order of the same examples in a prompt for GPT-3. I think it is safe to assume that ‘smarter’ models should be influenced less by ordering.

The researchers found that the model's predictions varied dramatically based on the sequence of examples. In some instances, the right permutation of examples led to near state-of-the-art performance, while others fell to nearly chance levels. The graph below shows more details.

‍

Multiple bar charts showing how performance can vary when changing the order of the examples used in a prompt — Source: Calibrate Before Use:
Improving Few-Shot Performance of Language Models

‍

One strategy worth testing is placing your most critical example last in the order. LLMs have been known to place significant weight on the last piece of information they process.

‍

What about the prompt format, what should come first, the examples or instructions?

While the more typical approach is to lead with the instructions followed by the examples, either approach can work and the best method might vary based on the model.

If you place the examples second and it seems like the model is either overemphasizing the last example or 'forgetting' the instructions, then consider having the instructions come last.

Another approach is to omit the instructions completely, like we did for our movie sentiment classifier. If the task is simple enough that the model can infer what to do, then basic instructions may not be necessary at all.

‍

Few shot prompting with reasoning models

It is unclear how effective few shot prompting is for newer reasoning models like o1-preview and o1-mini from OpenAI.

‍
For example, a recent paper, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond showed that using 5-shot prompting actually reduced performance compared to to minimal prompt baseline.

‍

A bar chart showing performance of o1 model when using certain prompt engineering methods in the Medprompt framework.

‍

A few months later the R1 release paper from DeepSeek came to a similar conclusion.

Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.

‍
If you want more information about R1, feel free to check out our full rundown here: DeepSeek R-1 Model Overview and How it Ranks Against OpenAI's o1

‍

The findings from these two papers about few-shot prompting and reasoning models actually align with OpenAI's guidance:

“Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.”

‍

So if you want to try out few shot prompting with reasoning models, start with just an example or two, and see how things go.

‍

For more info on prompt engineering with reasoning models and more data, check out our post: Prompt Engineering with Reasoning Models.

Five basic principles of few shot prompting for beginners

This is the part of the post where I say, "if you can only remember one section, remember this".
Here are the five major design principles to keep in mind when using few shot prompting.

Use at least 2 examples, but you probably don't need more than 5
Your examples need to be diverse
Use both positive and negative examples - the LLM can learn a lot from what a "bad" output looks like
Randomly order your examples
Make sure your few shot examples follow a common format

When to use few shot prompting

Okay great, but when should you use few shot prompting? Luckily, few shot prompting can be applied to almost any prompt and will help you get better and more consistent outputs. Here are a few use cases where few shot prompting can be particularly helpful.

Specialized Domains: When working in specialized fields such as legal, medical, or technical domains, where gathering vast amounts of data can be difficult, few shot prompting allows for high-quality, domain-specific outputs without the need for extensive datasets.
Dynamic Content Creation: Ideal for tasks like content generation where consistent styles and tone are paramount.
Strict Output Structure Requirements: Few shot prompting is particularly helpful in showing the model how you’d like your outputs to be structured.
Customized User Experiences: In personalized applications, such as chatbots or recommendation systems, where the AI needs to quickly adjust to individual user preferences and inputs.

‍

Why use few shot prompting

Here are some of the top reasons to try out few shot prompting:

Resource Efficiency: Few shot prompting only requires a few example pieces of data
Time Savings: It accelerates the model's ability to adapt to new tasks, which means quicker deployment times and faster time-to-market for AI-powered features and products.
Cost Reduction: Compared to the time spent gathering and labeling data to fine-tune a model, few-shot prompting is considerably cheaper, especially for smaller teams.
Small Lift, Big Gains: Setting up and testing few shot prompting is relatively easy and can help you get much better outputs.

An example from the research

It wouldn’t be a PromptHub article if we didn’t dive deep into some research. We’ll be checking out this paper from April 2024, from researchers at the University of London: The Fact Selection Problem in LLM-Based Program Repair.

Overview

The paper is centered around the use of various “facts” (examples) in prompts used to solve bugs in open-source projects on Github.

Methodology

Fact Collection: The researchers gathered a set of bug-related examples. These included details about buggy code, error messages and other types of documentation that could be helpful when solving future bugs.
Prompt Construction: The examples were incorporated into the prompts using few-shot prompting.
Evaluating Impact: The researchers then evaluated how different combinations of these examples affected the model’s ability to correctly solve the bugs

Findings

Utility of Examples: Each example contributed uniquely, highlighting the importance of having a diverse set of examples
More examples doesn’t mean better outputs: Interestingly, adding more examples didn’t always lead to better outcomes and sometimes degraded performance if the prompt becomes too cluttered or complex. (See graph below )
Fact Selection Model: The researchers built a statistical model named MANIPLE that algorithmically selected the most effective subset of facts for each bug, optimizing the prompt's effectiveness.

‍

line graph showing the performance scores versus number of facts used — Another example of the diminishing returns of adding examples into your prompt

‍

Overview of MANIPLE

The primary goal of MANIPLE is to maximize the gains from few shot prompting by identifying the optimal subset of examples that are most relevant and effective for each bug.

How MANIPLE works

Statistical Modeling: MANIPLE examines patterns patterns from past bug fixes to decide which examples to include to get successful outcomes.
Probabilistic Inference: The model isolates each example to determine how much it contributed positively to the successful bug fix.
Fact Selection Optimization: Based on the probabilities, MANIPLE selects the subset of facts that maximizes the likelihood of successful bug repair.

The MANIPLE framework led to a 17% increase in bug fixes. While this setup is a little advanced, it is a great example of how you can extend few shot prompting to achieve more significant results.

‍

Limitations and challenges of few shot prompting

As great as few shot prompting is, it isn’t perfect. The biggest limitation is its dependency on the quality and variety of the examples provided. Garbage in, garbage out, as they say.

Some times the examples can even degrade the performance, or send the model in the wrong direction.

There is also the risk of overfitting - where the model fails to generalize the examples and creates outputs that mimic the examples too closely. Additionally there are some biases to be aware of:

Majority Label Bias: The model tends to favor answers that are more frequent in the prompt. For example, going back to our movie sentiment task, if our prompt includes more positive than negative examples, the model may be biased towards predicting a positive sentiment. The magnitude of this bias varies based on the model.
Recency Bias: LLMs are known to favor the last chunk of information they receive. Revisiting our movie sentiment prompt, if the last few examples in a prompt are negative, the model may be more likely to predict a negative sentiment.

‍

Wrapping up

There you have it, a comprehensive guide on what we believe to be the most effective prompt engineering method out there. Few shot prompting has the best bang for buck in relation to its accessibility and how it can drastically enhance output quality. We hope this helps you get better outputs!

Dan Cleary

Founder