When working with teams on LLM integrations, we found they often would just jump directly into writing and iterating on prompts. While prompt engineering is obviously a key part of building and launching an LLM based feature, jumping right into it can make success much harder.
For example, how are you even going to know if the prompt is working “well” if you haven’t defined what success looks like?
This post will be all about what to do before writing your prompt, consider it your pre-prompt checklist. It will be broken down into two major sections defining success criteria and developing test cases and evaluations.
In this first portion, we will cover:
- What makes good success criteria: We'll dive into what makes success criteria effective
- Good and bad examples: We'll look at examples of both good and bad success criteria to show how well-defined goals can make all the difference.
- Example metrics and ways to measure success: We'll walk through success metrics and practical ways to measure them, making your criteria more concrete and applicable.
By the end of this post, you'll have a clear understanding of why defining success criteria is essential for LLM projects and how to create criteria that drive meaningful progress.
Why establishing success criteria is important
Success criteria act as the north star for your LLM project. They help you understand when you're on the right track and when you need to spend more time prompt engineering or tweaking other parts of your stack.
Whether you're building a chatbot to handle customer support, an LLM to summarize content, or any other application, having clear success criteria lets your team know if you're progressing effectively and what aspects need improvement.
The best part is, it doesn't have to be complex. Moving from no success criteria to even a basic set is a huge win.
What makes good success criteria
Everyone loves a good acronym, so here is a little framework for defining success criteria called SMART. This approach helps ensure your criteria are well-structured and actionable:
- Specific: The more specific your criteria, the better. Instead of saying "Improve chatbot responses," you could say "Increase the accuracy of chatbot responses to customer questions by 20%."
- Measurable: Aim for quantitative metrics when possible. For example, accuracy rate, response time, or user satisfaction scores.
- Achievable: Given your constraints, your goals need to be realistic based on the current capabilities of the model you're using. Use industry benchmarks, the current performance of the process you’re replacing, or insights from research papers to set attainable goals.
- Relevant: The success criteria should align with the use case and value that your application is generating.
- Time-bound: Deadlines are helpful when working on any type of project. Establishing a timeline for initially hitting success criteria helps align teams, keep projects moving forward, and indicate when the app or feature is ready to be shipped.
So remember, before you start any LLM project, get SMART! (And yes, there's probably a better joke here.)
Good and bad examples of success criteria
Continuing with the trend, below are a bunch of examples of both good and bad success criteria. I included an example for PromptHub's prompt generator as well.
PromptHub Prompt Generator
Chatbot to help internal users to make order requests to vendors
Sentiment classification task
LinkedIn post generator based on tech news
Common success criteria metrics and measurement methods
Here are a few more example metrics that are applicable across most tasks.
- Task Fidelity: How well does the model perform on the core task?
- Consistency: Are similar inputs producing similar outputs?
- Relevance and Coherence: Does the model directly address user queries in a logical and coherent way?
- Tone and Style: Does the output align with the expected tone and style?
- PII Preservation: How effectively does the model handle sensitive information?
- Context Utilization and Coherence: How well does the model utilize previous context?
- Latency: Is the response time acceptable given your use case?
- Price: Is your current model and prompt set up affordable as it scales?
- Accuracy: How often are the model's predictions or outputs correct? For classification tasks, aim for metrics like F1 score.
- User Satisfaction: What do users actually think about the quality of the model’s output? Nothing replaces talking to users
- Error Rate: How frequently does the model make mistakes?
- Scalability: Can the model and your system maintain performance under increased load? Track metrics like throughput to ensure scalability.
Six common mistakes when defining success criteria
Here are the six most frequent mistakes we see when teams are defining their success criteria.
- Setting vague goals: Goals like "improve performance" or "make responses better" are too ambiguous.
- Ignoring edge cases: Weird things will happen in production. It’s important to consider edge cases where input values differ greatly from ‘typical’ inputs. Your success criteria should account for these edge cases
- Misalignment with user value: Avoid focusing only on technical metrics without considering their impact on user value.
- Lack of measurable metrics: Goals that cannot be measured can't be improved.
- Not setting a timeline: Set deadlines to keep the project on track and provide clear milestones for evaluation.
- Failing to iterate: Whether it be because of lack of tooling or process, failing to iterate on your prompts to reach the threshold of your success criteria will obviously lead to failure. It’s critical to have the proper testing and infrastructure in place from the beginning of a project (we can help with this).
Practical steps to validate success criteria
Defining your success criteria is one thing, actively monitoring it is another.
Here are a few ways you can actively validate and monitor your success criteria.
- Talk to your users! Talking to your users is the most effective way to determine if what you've built is actually working. Surveys, screen recordings, and dashboards only get you so far.
- Run A/B tests: Compare different versions of prompts to see which performs better according to your criteria. PromptHub’s Git-like versioning and monitoring features make it easy for developers to serve different prompt variants to users based on branches or hash values.
- Collect user feedback: Enable passive feedback collection in the form of surveys or ratings to understand user experiences.
- Monitor metrics in real-time: PromptHub makes it easy to monitor metrics in real-time for all your prompts. These metrics can help supplement qualitative data from users.
- Perform edge case testing: Test your prompts at scale across a wide range of cases to validate performance. Batch testing your prompts with predefined datasets is an easy way to do this in PromptHub.
- Analyze errors and failures: Regularly dig into LLM requests and outputs to understand production performance. Flagging incorrect or poor outputs can help build more robust rules. In PromptHub, you can track requests, associate them with specific users, and monitor performance under the logs tab.
- Iterate based on results: Shipping LLM features isn’t a one-and-done scenario; it’s iterative. Use data from A/B tests, user feedback, and request logs to continuously improve your prompts and refine success criteria.
Step 2: Evaluations and test cases
Now that we’ve defined what success looks like, next up is figuring out how to measure whether your LLM based feature or application is meeting those goals. This is where developing test cases and evaluations (or as the cool kids say “evals”) come into play.
Creating and running evals makes it much easier to track your model’s performance against the success criteria you’ve set. Importantly, it’s not just about testing using standard inputs, you need to stress test your prompts with data that is representative of the real-world.
In this section, we’ll walk through:
- How to choose the right eval for your use case: Different tasks require different types of evaluations
- Common eval grading methods: From code-based exact matches to more complex LLM-based grading, you’ll learn how to score your evals efficiently.
- Developing test cases to ensure your prompts are well-tested: We’ll explore how to cover edge cases and tricky scenarios, making sure your prompts perform reliably across a wide range of inputs.
By the end of this article, you’ll be equipped to build and run test cases that not only measure the performance of your LLM but also help you continuously improve its accuracy, relevance, and robustness.
What are evals and why are they important?
Evaluations, also known as evals, provide a structured way to measure whether the model is meeting the success criteria you defined and help identify areas of improvement for your prompt.
Evals are an ongoing process, not a one-time test. Models change without us knowing, new models get released all the time, and apps are constantly changing, so you need to be running your evals continuously to ensure you’re still delivering a great experience for your users.
The four key parts of an eval
A well-structured eval consists of four components:
- Prompt: The prompt sent to the model
- Output: The response generated by the LLM
- Golden Answer: The ideal or expected response that the output is compared against.
- Score: A quantitative measure of how well the output aligns with the golden answer
Types of eval grading methods
In general, there are three primary methods to do evaluations.
1. Code-Based Grading
Code-based grading is the fastest and most reliable method for tasks with clear, rule-based outputs. It involves checking for exact matches or specific key phrases in the model’s output.
- Best for: Simple, rule-based tasks (e.g., sentiment analysis, classifications)
- Example: For a task like identifying the sentiment of movie reviews, you can easily check if the output matches the correct answer.
2. Human Grading
Human grading is the most flexible, and allows for the most learning, but is the slowest option.
- Best for: Complex or subjective tasks (e.g., content generation, customer service tone).
- Example: If you ask the model to create a workout plan, a human grader would assess whether the plan includes the required exercises, reps, and structure.
3. LLM-Based Grading
LLM-based grading bridges the gap between automated and human grading. There is some debate on how much you should rely on LLMs to grade their own outputs, with one of the variables to manage being the prompt given to the evaluator LLM. In general, we think this option is great to have in your set up. For more on using LLMs as evaluators, check out our post: Can You Use LLMs as Evaluators? An LLM Evaluation Framework.
- Best for: Tasks requiring complex evaluations (e.g., tone, relevance, coherence).
- Example: A model could be asked to evaluate its response to a customer inquiry for empathy, rating itself on a scale of 1-5 using a prompt like: “On a scale of 1-5, rate the empathy of this response: {{response}}.”
Choosing the right method
To put it briefly:
- Code-based grading works well for tasks that have known answers
- Human grading should be used when tasks involve creativity, subjectivity, or require some level of nuanced understanding.
- LLM-based grading pairs nicely with human grading and offers a scalable way to evaluate tasks that fall between rigid rules and subjective judgments.
Eval design principles
If you're just getting started with evaluations here are a few principles to keep in mind:
- Tailor to your task: Design evaluations that closely reflect your specific use cases, including typical scenarios and edge cases.
- Automate when possible: If possible, try to leverage automated grading in some way (PromptHub can help with this).
- Favor quantity over perfection: Use a larger number of simpler evals rather than a few complex ones to make sure you are covering a wider base.
Example evaluations
In this section, we’ll walk through various evaluations, each paired with a specific use case and grading method.
Tone and coherence for a legal assistant – LLM-Based coherence evaluation
What it measures: This evaluates how well the model maintains a coherent and logical tone as a legal assitant drafting documents.
Context utilization for a support bot – LLM-Based Likert scale
What it measures: Evaluates how well the chatbot maintains context from earlier interactions and uses it to inform responses, rated on a 1-5 scale for context depth.
Consistency for a customer query bot – ROUGE-L evaluation
What it measures: Assesses how consistent and similar the responses are when paraphrased or repeated queries are asked, focusing on capturing the core content using ROUGE-L scoring.
Task accuracy for a survey classifier – LLM-Based binary classification
What it measures: Evaluates the accuracy of a survey classifier model, ensuring it categorizes responses correctly as binary (e.g., positive/negative) outcomes.
Relevance and fidelity for a news summarizer – Cosine similarity evaluation
What it measures: Assesses how relevant and accurate a news summarizer is by comparing generated summaries to original articles using cosine similarity.
Now that we've looked at a couple of methods to evaluate outputs, lets generate some test cases to run these evaluations.
Generating test cases
Evaluations are only as good as the test cases you evaluate against. Well-constructed test cases should cover a wide range of inputs and ensure the model can handle both standard and edge-case scenarios.
Start with common scenarios
- Begin by creating test cases that reflect the most common or typical inputs your model will encounter. These represent the "happy path" scenarios.
- Example: For a customer service chatbot, you might include queries like "What is your return policy?" or "How can I track my order?"
Include edge cases
- Test cases should include unusual, ambiguous, or extreme inputs.
- Examples of edge cases:
- Ambiguous inputs: "What is the return policy for items bought two months ago if I lost my receipt?"
- Irrelevant or nonexistent data: Input queries with irrelevant details or missing information to see how the model responds.
- Contradictory information: "I want to return this product, but I also want to keep it."
Leverage LLMs for test case generation
- Use LLMs to generate a broad range of test cases quickly. Start with a few manually crafted examples and let the model generate variations.
- Example: Provide a few baseline questions for an FAQ bot and have the LLM create paraphrased or alternative versions.
- Prompt Example: "Generate variations of this question: 'How do I return an item I purchased?'"
Using a PromptHub form we built, you can generate test cases that cover happy paths and edge cases. Try it below or open it up in a new tab via this link.
Test real-world user behavior
- Gather real-world data from your users and create test cases that mirror actual queries
Iteratively expand your test suite
- As you identify times where your prompt failed, edit the prompt and add new test cases to test against the fail cases
- Example: If the model struggles with sarcastic inputs in sentiment analysis, include more cases that contain sarcasm: "Oh, great. Another broken product."
Wrapping up
Before you write a single prompt, you should define what success looks like and develop at least a few test cases and evaluation methods. These will help you down the road so you can benchmark performance and track how your prompt engineering efforts are relating to performance.
If you only remember three things from this post, remember these:
- Evaluations are not a one-time task; they are part of a continuous improvement cycle.
- Automation can improve the speed and scalability of your evals (PromptHub can help with this).
- Generate and test across diverse test cases.
With your success criteria, evals, and test cases in place, now you can move onto the fun part, prompt engineering. To jump start the process, you can try out our automatic prompt generator and learn some of the basic principles in our other post here: 10 Best Practices for Prompt Engineering with Any Model.