Who's better at writing prompts, humans or LLMs? I've found that a blend of human and LLM input works best.
One thing I know for sure is that the way you instruct a model with a prompt is extremely important. Even single-word differences in a prompt can have an outsized effect.
That's why this research from the folks over at Deepmind stuck out to me: Large Language Models As Optimizers.
They developed a method called Optimization by PROmpting (OPRO) to help with the prompt optimization process. At its core, OPRO leverages LLMs to iteratively evaluate outputs and optimize prompts.
Why OPRO?
While prompt engineering is relatively new, there has been a flurry of research studies resulting in countless methods and techniques. These range from multi-person collaboration to the Tree of Thoughts method and many others. Each of these methods has its own set of advantages and disadvantages.
OPRO stands out among these methods by leveraging LLMs as prompt optimizers. OPRO iteratively optimizes prompts to continuously generate new solutions, refining its outputs based on the problem description and previously discovered solutions. (we’ll walk through the steps)
Its dynamic, feedback-driven process ensures that the optimization is not only accurate but also adaptive to the nuances of the specific task.
How OPRO works
At the heart of OPRO is a framework designed to integrate 2 LLMs (an optimizer and evaluator) to iteratively improve the meta-prompt's output.
Problem Setup:
Every optimization journey starts with a clear problem setup. This involves representing the task using datasets that have both training and test splits. The training set refines the optimization process, while the test set evaluates the efficacy of the optimized method.
(We have a full concrete example below.)
Meta-Prompt Creation:
Central to OPRO, the meta-prompt contains 3 pieces of info: previously generated prompts with their training accuracies, examples, and a description of the optimization problem.
Iterative Solution Generation
OPRO’s strength lies in its iterative nature. The process is continuous. Every subsequent attempt at optimizing the prompt and output takes into account the previous attempts.
New prompts and outputs are added to the meta-prompt (see blue text in image above).
Harnessing the power of trajectory
Since this framework takes into account previous solutions and their scores, the LLM can excel at what it does best: identifying patterns and trends.
Feedback and refinement
After each optimization step, the generated solutions are evaluated, and this feedback is looped back into the meta-prompt.
Putting OPRO into practice
Before jumping into the experiments and the results of the study, let's look at a quick example.
Let’s assume we’re a venture capital firm. Our goal is to categorize startups we've invested in or communicated with, based on their stage (e.g., pre-seed, seed, Series A, healthcare, fintech).
Step 1: Define the Task
Classify startups into predefined categories based on the content of their emails.
Step 2: Data Collection and Preprocessing
Now we’ll gather a dataset of emails from the startups and label them with the correct classifications. This labeled data will be important for training and evaluation.
We don’t need anything complex, a simple Google Sheet will do. Column A will contain the email content and Column B will contain the startup’s category.
Step 3: Create initial Meta-Prompt
Now we’ll write an initial meta-prompt that describes that task at hand, with a few examples from our labeled dataset. Something like:
Step 4: Iterative Optimization with OPRO
The meta-prompt is fed into an LLM. The LLM interprets the prompt to understand the classification task, reviewing the examples and type of data that it’s dealing with.
Next up comes the optimization step. The meta-prompt is analyzed by an LLM in order to find improvements. The LLM generates a set of new prompts that it believes could improve classification results.
The responses will be a list of prompts to test:
- "Analyze the primary product or service mentioned in the email. Determine its industry relevance and classify the startup based on its current stage and sector."
- "Consider the startup's product, user base, and partnerships. Classify it into a category that best represents its industry and growth phase."
- "Based on the email's content, identify the startup's core offering and its market traction. Classify it into the most fitting category."
- "Evaluate the startup's product, its target audience, and any mentioned achievements. Assign a category that best encapsulates its industry and stage."
- "Review the email for clues about the startup's main product, user engagement, and collaborations. Classify the startup accordingly."
After these prompts are tested, their results will be appended to the meta-prompt:
From here, the iterative process continues. After evaluating new prompts, feedback is used to refine the meta-prompt, and the process is repeated until you’re satisfied with the prompt’s performance.
Experiment setup
OPRO was put to the test across a diverse set of tasks, ranging from classic optimization problems like linear regression to more contemporary challenges in movie recommendations and various natural language processing tasks.
Setup
The researchers used a range of LLMs for both optimization and scoring:
- Optimizer LLMs:
- Pre-trained PaLM 2-L (Anil et al., 2023)
- Instruction-tuned PaLM 2-L (denoted as PaLM 2-L-IT)
- Text-bison
- GPT-3.5-turbo
- GPT-4
- Scorer LLMs:
- Pre-trained PaLM 2-L
- Text-bison
The temperature of the evaluator LLM was set to 0, since this task was more deterministic. Simply evaluating accuracies.
However, the temperature for the optimizer LLM was set to 1.0 to allow for more creative and diverse prompt generations.
Experiment Results
Efficient Prompt Optimization: Even with a fraction (3.5%) of the training data used, OPRO outperformed many other prompting benchmarks.
Superior Performance with Zero-Shot Prompting: OPRO's top instructions matched chain-of-thought performance and exceeded zero-shot, human-crafted prompts by 8%.
Diverse Optimization Styles: The prompts that performed best for each model varied greatly in length. Check out the instructions from PaLM 2-L-IT and text-bison compared to GPT-4 (last in the list).
Sensitivity to Word Choice: Even slight variations in semantically similar instructions led to significant differences in accuracies. Word choice matters in prompt engineering!
Critical Role of Examples: Examples significantly impacted optimization, especially from 0 to 3 examples. However, benefits lessened from 3 to 10. This emphasizes the importance of balancing your prompts.
Balancing Exploration and Exploitation with Temperature: A temperature of 1.0 had the best results. Lower temperatures led to a lack of exploration, resulting in stagnant optimization curves and higher temperatures often overlooked the trajectory of previous instructions.
Implications and looking forward
Couple of last points
- Effects on Prompt Engineering: OPRO highlights the importance of iterative and feedback-driven approaches to prompt engineering. We strongly believe in this and have seen first hand how a little iterative work goes a long way. PromptHub makes this easy.
- Training and Fine-Tuning: OPRO was extremely efficient with very limited training data.
- Versatility Across Domains: OPRO can be applied across domains, and may reduce the need for domain-specific fine-tuning,
OPRO reinforces a belief that I already had: LLMs greatly speed up the prompt writing process. OPRO's systematic approach is promising and can be a huge point of leverage.