Generated Knowledge Prompting

Large Language Models (LLMs) have an extensive knowledge base, having been trained on virtually all text available on the internet. When we prompt LLMs, they go to specific parts of that knowledge base to retrieve information. Sometimes all it takes is a misused word or two and you could send the model in the wrong direction, models are sensitive.

One strategy to better leverage an LLM’s knowledge is to have it generate related knowledge before answering the question at hand or completing the task. This method is known as Generated Knowledge Prompting.

‍

What is Generated Knowledge Prompting?

Generated Knowledge Prompting is a prompt engineering method that first prompts the LLM to generate useful knowledge related to the task, and then incorporate the knowledge into the prompt alongside the question or task description.

‍

Generated Knowledge Prompting was first written about in a paper from 2022. It is particularly helpful for tasks that require a deep understanding of context, like generated code inside a codebase, but can be used across a wide range of tasks.

‍

Let’s look at a quick example: a customer is using a chatbot to ask about rebooking a flight.

Customer question

"What are the rebooking options if my flight from New York to London is canceled?"

‍

Prompt to generate knowledge

"Retrieve current UK travel restrictions for passengers flying from New York and check the availability of the next flights from New York to London."

‍

Final integrated prompt

Knowledge: "The current UK travel restrictions allow only limited flights. The next available flight from New York to London is on [date].
User Query: What are the rebooking options for a passenger whose flight has been canceled?"

‍

Originally, Generated Knowledge Prompting was designed as a two-step process involving separate prompts.

‍

Information flow for generated knowledge prompting — Original design of Generated Knowledge Prompting. Knowledge was generated in 1 prompt, and then included as context in a second prompt

‍

However, it's possible to streamline this into a single prompt, which aligns closely with another method we really like called, Analogical Prompting.

‍

Analogical prompting template in PromptHub

‍

Benefits of Generated Knowledge Prompting

Generated Knowledge Prompting can helpful in a few ways:

Higher Accuracy: The additional context helps the model provide more precise and relevant answers
Adaptability: Generated Knowledge Prompting enables models to adapt to new information quickly without needing extensive retraining or fine-tuning
Depth of Understanding: With the proper guardrails in place, models can explore topics in greater depth

‍

How the model generates knowledge

The original researchers generated knowledge by prompting the model with an instruction, a few demonstrations, and new questions with placeholders.

‍

2 examples of prompts used to generate knowledge in generated knowledge prompting — On both sides you can see that the model is being shown how to generate knowledge via few-shot prompting, and then is prompted to generate new knowledge for the given question.

‍

The demonstrations in this case were human-written, but you could use an LLM for this, as long as you verify the data.

‍

Hey everyone, how's it going? This is Dan here for PromptHub, and we have a super cool new prompt engineering method to go over today called Generated Knowledge Prompting. What's interesting is that although this paper came out a while ago—which in AI years is almost like two decades—it's actually from 2022. There have been a bunch of other remixes of this type of prompting, but I believe this is the first version of it.

Generally, Generated Knowledge Prompting is an engineering method that uses a prompt to first tell the model to generate some useful knowledge, information related to the task or question at hand, and then there's a second prompt that basically pastes in that knowledge and then asks the actual question. So it’s a two-step process, though you could kind of push it into one, but they have it separated as two.

For example, if a customer asked a question about rebooking a flight from New York to London, the first prompt is sent just looking at current travel restrictions in the UK, and so on. Then you take that output, inject it as the knowledge, and have your user query here. It takes the output from the second prompt, adds it in here where it’s relevant to what the question is being asked, and then has the question asked.

Again, this is what that flow looks like: there's a question, it generates some knowledge based on that question, plugs in that knowledge, integrates it, and brings the question as well, and then you get an answer. It’s very similar to something we talked about before, which is Analogical Prompting. As you can see, we have a template for this, so you can access it, but basically, Analogical Prompting says, 'Here's the problem, here are the instructions, first identify some core concepts, recall three problems, for each problem describe and explain the solution, then solve the initial problem.' These kinds of tutorial and relevant problem steps are the knowledge part.

This method has been remixed in other ways as well. The benefits are generally higher accuracy, that’s what this study found, and it has been replicated across the Analogical Prompting study and others. This additional context is a really lightweight way to give the model the information it needs to retrieve without having to set up any complex pipelines, making it really adaptive. You can go to very specific parts of the model’s knowledge by letting it generate this information first. It even seems like it can get to a layer deeper of understanding since it’s already done that cursory search of like blanket information, which I think has really interesting implications.

As I mentioned, there are a lot of different ways you could generate this knowledge. Here’s how they did it: basically, they would send some examples and say, 'Hey, here’s the input, and then here’s a knowledge-generated input,' and then the LLM would kind of fill in the blank here with the given knowledge.

We will still look at the results even though they're quite old. Basically, they tested the Generated Knowledge Prompting against a couple of different baselines: no knowledge, random sentences, which had basically random, not related sentences injected into the prompt, contextual sentences that had proper context but not being like actual knowledge, template knowledge which had templates to extract knowledge statements from the models, knowledge was getting stuff from sources like Google or Wikipedia, and then answers which directly prompt the model to generate answers. It’s kind of similar to Analogical Prompting in that way, using a single answer to assess F-shop performance. It’s not super important if that doesn’t make complete sense, we’ll just take a look at the results here.

So, we’re obviously seeing a lot, you know, RS being the Generated Knowledge Prompting, we see a lot of bulbs across the board, which is typical from a paper here. It outperformed the baseline by a fair amount, outperformed two-shot prompting on most, and I think interestingly it outperformed the retrieval-based prompting on a few of the sets but not all of them. It also shows that the model has a lot of information in there, and extracting it is something that you could turn to before looking to do retrieval-based stuff just because that has an additional layer of complexity and an additional step where things can get messed up.

I think that's an interesting takeaway from this that is probably still true today, especially if you think if the models are getting better, then you would think this difference would actually grow versus get smaller. And I think the last thing to note here is obviously, on a couple of these datasets, specifically on this QC dataset, the retrieval-based method outperformed the Generate Knowledge by a fair amount. Something to note here is that the retrieved knowledge for this source wasn’t just like Google or Wikipedia, like a generic source; it was a kind of like gold-standard knowledge base that was specifically designed to construct this dataset, so it was hyper-relevant. That’s the key to go, and having hyper-relevant retrieval-based systems is great.

I think if you want to try that out, that’s really worth testing. Having more loosely like retrieval-based pipelines is not, I don’t think they’re going to make a huge difference. I think that’s something that is probably still true today, and there’s additional study at the end in terms of how much knowledge is needed. This graph looks very similar to the graphs shown if you look at how many examples in F-shop prompting increase performance where there’s a big jump up, and then here we basically plateau from 1 to 50, and it’s similar for the F-shop prompting, except it plateaus right away, around like maybe four to six.

So, I think it’s something to take note of as well. But yeah, we hope this helps, we have templates on PromptHub to try this out for free, so feel free to go check it out there, and yeah, happy prompting!

‍

Experiment results

While the study is slightly outdated at this point (GPT-3 was the top OpenAI model at the time), there are still some insights to takeaway. Before we get to charts and tables, let's set the scene.

The researchers tested Generated Knowledge Prompting against a few baselines:

No Knowledge (Ø): The vanilla baseline
Random Sentences (R): Involves sampling random sentences from the LLM without tailoring them to the specific question
Context Sentences (C): Consists of sampling text continuations from the context of the question, which means generating sentences that logically follow the question's content
Template-generated Knowledge (T): Uses manually-designed templates to extract knowledge statements from the models
Retrieval-based Knowledge (IR): Retrieves knowledge from external sources like Wikipedia, Google, etc
Answers (A): Directly prompts the model to generate answers, using a single answer to assess few-shot performance or 20 answers to prompt SOTA models

Alright, now with that out of the way, let’s look at some results

‍

A table of results comparing various prompting methods across a variety of datasets

‍

Takeaways

Zero-shot settings: Shows solid improvements of 7% -10% across NumerSense, CSQA, and QASC
Generated knowledge > few-shot: Generated knowledge outperformed few-shot, improving performance by 14% to 20% across commonsense reasoning tasks.
Generated knowledge > retrieval-based: Generated knowledge outperformed retrieval-based knowledge from large sources (Wikipedia, Google), improving performance by up to ~9%, showing that tailored generated knowledge is more effective than loosely-related retrieved knowledge.
Generated knowledge < retrieval-based: On the QASC dataset, the retrieval-based method outperforms generated knowledge prompting. This is because the retrieved knowledge is sourced directly from a gold-standard knowledge base specifically designed to construct the dataset, making it highly relevant.

‍

How much knowledge is needed?

Does quantity matter? This graph looks very similar to the performance gains relative to the number of examples included in a few-shot prompt (you can check it out in our Few Shot Prompting Guide) .

‍

Bar chart showing performance versus knowledge quantity

‍

As shown above, performance generally increases with the quantity of knowledge statements included, but the majority of the gains come from the inclusion of any knowledge statements.

‍

Wrapping up

Generated Knowledge Prompting, while an older method, is highly effective. It is still very useful with today’s models, and has even been remixed into other types of methods like Analogical prompting. Generated Knowledge Prompting is extremely adaptable and can help across a variety of types of tasks. Keep this one top of mind. It’s easy to implement, very flexible, and extremely effective.