Everything you need to do before prompting: Success criteria, test cases, evals

When working with teams on LLM integrations, we found they often would just jump directly into writing and iterating on prompts. While prompt engineering is obviously a key part of building and launching an LLM based feature, jumping right into it can make success much harder.

For example, how are you even going to know if the prompt is working “well” if you haven’t defined what success looks like?

This post will be all about what to do before writing your prompt, consider it your pre-prompt checklist. It will be broken down into two major sections defining success criteria and developing test cases and evaluations.

‍

In this first portion, we will cover:

What makes good success criteria: We'll dive into what makes success criteria effective
Good and bad examples: We'll look at examples of both good and bad success criteria to show how well-defined goals can make all the difference.
Example metrics and ways to measure success: We'll walk through success metrics and practical ways to measure them, making your criteria more concrete and applicable.

By the end of this post, you'll have a clear understanding of why defining success criteria is essential for LLM projects and how to create criteria that drive meaningful progress.

‍

Hey everyone, how’s it going? Dan here from PromptHub. Happy Tuesday! We’re already halfway through October, and today we’re going to talk about everything you need to do before writing your first prompt for a new project, feature, or whatever you’re trying to ship to production.

Two Steps Before Writing Prompts

This process boils down to two key steps: setting up your success criteria and defining your evaluations and test cases. We created this video and blog post because, whenever we help teams, we often see them jump straight into writing prompts. While that can be good, you might not be setting yourself up for success if you don’t first define what success looks like. For example, how would you know if a prompt is working well if you haven’t defined the metrics for that?

Step 1: Setting Up Success Criteria

Success criteria should be your North Star while building your LLM-based feature, application, or whatever it is you're working on. It’s a quantifiable metric that you can measure over time. It helps align the team, and through testing, it will show you where things are falling short—whether that’s in the context being served to the model, short-term memory issues, long-term memory issues, the prompt itself, or the model. Good success criteria will help you get where you need to go.

What Makes Good Success Criteria?

We developed a framework based on the SMART acronym:

Specific: Avoid ambiguous goals like "improve chatbot responses." Instead, set specific goals like "increase user engagement."
Measurable: Use quantitative metrics whenever possible.
Achievable: Make sure your goals are realistic based on the capabilities of the model.
Relevant: Align your goals with the user value you're trying to deliver.
Time-bound: Set a timeline to keep the project moving and to track progress.

Good vs. Bad Examples

Let’s look at some examples of good and bad success criteria:

Bad example: "Increase user engagement." This is vague and doesn’t capture the value of what you’re doing.
Good example: "Increase the number of users grading prompts with this tool by 20% over the next 3 months." This is specific, measurable, and time-bound.

Another example for a chatbot: Instead of saying "handle order requests effectively," say "successfully process 95% of requests without human intervention by the end of the quarter."

Mistakes to Avoid

Common mistakes include setting vague goals, not aligning criteria with user value, not setting a timeline, and failing to iterate. Make sure to track how you’re doing against your criteria and adjust as needed.

Step 2: Evaluations and Test Cases

Now that you’ve defined your success criteria, you need to think about how to measure your progress. This is where evaluations and test cases come into play. The graphic below (from Anthropic's documentation) illustrates the flow of iterating and testing your prompts based on evaluations.

What is an Evaluation?

An evaluation consists of four key parts:

The Prompt: What you send to the model.
The Output: The response from the model.
The Golden Answer: The ideal or expected output.
The Score: A measure of how well the output matches the golden answer.

Types of Evaluation

There are three common types of evaluation:

Code-based: This is quick and easy for tasks like exact match or sentiment analysis, where you know what the answer should be.
Human grading: Ideal for subjective or complex tasks like content generation, creative writing, or anything where human judgment is necessary.
LLM-based grading: Using an LLM to evaluate outputs is fast and scalable. It’s helpful for tasks that fall between exact match and highly complex outputs.

Examples of Evaluations

Here are a few examples:

Tone and coherence for a legal assistant: Use an LLM to evaluate coherence based on the task being run.
Context utilization for support chatbots: Rate the chatbot’s response on a scale of 1 to 5 based on how well it uses the provided context.

Test Cases

The most important point about test cases is that you need to account for random inputs because users will inevitably input unexpected things. For example, what happens if the input is blank? What happens if the input includes inappropriate content? What happens if you’re expecting an image, but no image is sent? Start with happy path scenarios and expand to include edge cases.

Takeaways

Evaluations should be continuous, not a one-time thing.
Automation can help, whether using LLMs or code-based evaluators.
Your test cases should be diverse, covering a wide range of possible inputs.

Once you have your success criteria, evaluations, and test cases in place, you’re ready to move on to prompt engineering and hit your metrics!

That’s it for today. Thanks everyone!

‍

Why establishing success criteria is important

Success criteria act as the north star for your LLM project. They help you understand when you're on the right track and when you need to spend more time prompt engineering or tweaking other parts of your stack.

Whether you're building a chatbot to handle customer support, an LLM to summarize content, or any other application, having clear success criteria lets your team know if you're progressing effectively and what aspects need improvement.

The best part is, it doesn't have to be complex. Moving from no success criteria to even a basic set is a huge win.

‍

What makes good success criteria

Everyone loves a good acronym, so here is a little framework for defining success criteria called SMART. This approach helps ensure your criteria are well-structured and actionable:

Specific: The more specific your criteria, the better. Instead of saying "Improve chatbot responses," you could say "Increase the accuracy of chatbot responses to customer questions by 20%."
Measurable: Aim for quantitative metrics when possible. For example, accuracy rate, response time, or user satisfaction scores.
Achievable: Given your constraints, your goals need to be realistic based on the current capabilities of the model you're using. Use industry benchmarks, the current performance of the process you’re replacing, or insights from research papers to set attainable goals.
Relevant: The success criteria should align with the use case and value that your application is generating.
Time-bound: Deadlines are helpful when working on any type of project. Establishing a timeline for initially hitting success criteria helps align teams, keep projects moving forward, and indicate when the app or feature is ready to be shipped.

So remember, before you start any LLM project, get SMART! (And yes, there's probably a better joke here.)

‍

Good and bad examples of success criteria

Continuing with the trend, below are a bunch of examples of both good and bad success criteria. I included an example for PromptHub's prompt generator as well.

PromptHub Prompt Generator

Bad	Why It's Bad	Good	Why It's Good
Increase user engagement.	Vague, doesn't specify engagement type or by how much.	Increase the number of unique users creating prompts using the generator by 25% over the next three months.	Specific, measurable, time-bound, defines the type of engagement.

‍

Chatbot to help internal users to make order requests to vendors

Bad	Why It's Bad	Good	Why It's Good
Handle order requests effectively.	Ambiguous, lacks specifics on what "effectively" means.	Successfully process 95% of order requests without human intervention by the end of the quarter.	Specific, measurable, clearly defines success for processing orders.

‍

Sentiment classification task

Bad Criteria	Why It's Bad	Good Criteria	Why It's Good
Classify sentiments accurately.	Vague, doesn't specify accuracy level or dataset.	Achieve an F1 score of at least 0.85 on a test set of 10,000 pieces of user feedback.	Specific, measurable, and sets a clear performance goal.

‍

LinkedIn post generator based on tech news

Bad Criteria	Why It's Bad	Good Criteria	Why It's Good
Generate engaging LinkedIn posts.	"Engaging" is subjective without specific metrics.	Generate LinkedIn posts that achieve an average engagement rate (likes, comments, shares) of 5% over the next three months.	Specific, measurable, aligns with user engagement objectives.

‍

Common success criteria metrics and measurement methods

Here are a few more example metrics that are applicable across most tasks.

Task Fidelity: How well does the model perform on the core task?
Consistency: Are similar inputs producing similar outputs?
Relevance and Coherence: Does the model directly address user queries in a logical and coherent way?
Tone and Style: Does the output align with the expected tone and style?
PII Preservation: How effectively does the model handle sensitive information?
Context Utilization and Coherence: How well does the model utilize previous context?
Latency: Is the response time acceptable given your use case?
Price: Is your current model and prompt set up affordable as it scales?
Accuracy: How often are the model's predictions or outputs correct? For classification tasks, aim for metrics like F1 score.
User Satisfaction: What do users actually think about the quality of the model’s output? Nothing replaces talking to users
Error Rate: How frequently does the model make mistakes?
Scalability: Can the model and your system maintain performance under increased load? Track metrics like throughput to ensure scalability.

‍

Six common mistakes when defining success criteria

Here are the six most frequent mistakes we see when teams are defining their success criteria.

Setting vague goals: Goals like "improve performance" or "make responses better" are too ambiguous.
Ignoring edge cases: Weird things will happen in production. It’s important to consider edge cases where input values differ greatly from ‘typical’ inputs. Your success criteria should account for these edge cases
Misalignment with user value: Avoid focusing only on technical metrics without considering their impact on user value.
Lack of measurable metrics: Goals that cannot be measured can't be improved.
Not setting a timeline: Set deadlines to keep the project on track and provide clear milestones for evaluation.
Failing to iterate: Whether it be because of lack of tooling or process, failing to iterate on your prompts to reach the threshold of your success criteria will obviously lead to failure. It’s critical to have the proper testing and infrastructure in place from the beginning of a project (we can help with this).

‍

Practical steps to validate success criteria

Defining your success criteria is one thing, actively monitoring it is another.

Here are a few ways you can actively validate and monitor your success criteria.

Talk to your users! Talking to your users is the most effective way to determine if what you've built is actually working. Surveys, screen recordings, and dashboards only get you so far.
Run A/B tests: Compare different versions of prompts to see which performs better according to your criteria. PromptHub’s Git-like versioning and monitoring features make it easy for developers to serve different prompt variants to users based on branches or hash values.
Collect user feedback: Enable passive feedback collection in the form of surveys or ratings to understand user experiences.
Monitor metrics in real-time: PromptHub makes it easy to monitor metrics in real-time for all your prompts. These metrics can help supplement qualitative data from users.
Perform edge case testing: Test your prompts at scale across a wide range of cases to validate performance. Batch testing your prompts with predefined datasets is an easy way to do this in PromptHub.
Analyze errors and failures: Regularly dig into LLM requests and outputs to understand production performance. Flagging incorrect or poor outputs can help build more robust rules. In PromptHub, you can track requests, associate them with specific users, and monitor performance under the logs tab.
Iterate based on results: Shipping LLM features isn’t a one-and-done scenario; it’s iterative. Use data from A/B tests, user feedback, and request logs to continuously improve your prompts and refine success criteria.

‍

Step 2: Evaluations and test cases

Now that we’ve defined what success looks like, next up is figuring out how to measure whether your LLM based feature or application is meeting those goals. This is where developing test cases and evaluations (or as the cool kids say “evals”) come into play.

Creating and running evals makes it much easier to track your model’s performance against the success criteria you’ve set. Importantly, it’s not just about testing using standard inputs, you need to stress test your prompts with data that is representative of the real-world.

‍

An example workflow for writing, testing and iterating on prompts — From Anthropic's documentation -which is a fantastic resource for prompt engineering

‍

In this section, we’ll walk through:

How to choose the right eval for your use case: Different tasks require different types of evaluations
Common eval grading methods: From code-based exact matches to more complex LLM-based grading, you’ll learn how to score your evals efficiently.
Developing test cases to ensure your prompts are well-tested: We’ll explore how to cover edge cases and tricky scenarios, making sure your prompts perform reliably across a wide range of inputs.

By the end of this article, you’ll be equipped to build and run test cases that not only measure the performance of your LLM but also help you continuously improve its accuracy, relevance, and robustness.

‍

What are evals and why are they important?

Evaluations, also known as evals, provide a structured way to measure whether the model is meeting the success criteria you defined and help identify areas of improvement for your prompt.

Evals are an ongoing process, not a one-time test. Models change without us knowing, new models get released all the time, and apps are constantly changing, so you need to be running your evals continuously to ensure you’re still delivering a great experience for your users.

‍

The four key parts of an eval

A well-structured eval consists of four components:

Prompt: The prompt sent to the model
Output: The response generated by the LLM
Golden Answer: The ideal or expected response that the output is compared against.
Score: A quantitative measure of how well the output aligns with the golden answer

‍

Types of eval grading methods

In general, there are three primary methods to do evaluations.

1. Code-Based Grading

Code-based grading is the fastest and most reliable method for tasks with clear, rule-based outputs. It involves checking for exact matches or specific key phrases in the model’s output.

Best for: Simple, rule-based tasks (e.g., sentiment analysis, classifications)
Example: For a task like identifying the sentiment of movie reviews, you can easily check if the output matches the correct answer.

2. Human Grading

Human grading is the most flexible, and allows for the most learning, but is the slowest option.

Best for: Complex or subjective tasks (e.g., content generation, customer service tone).
Example: If you ask the model to create a workout plan, a human grader would assess whether the plan includes the required exercises, reps, and structure.

3. LLM-Based Grading

LLM-based grading bridges the gap between automated and human grading. There is some debate on how much you should rely on LLMs to grade their own outputs, with one of the variables to manage being the prompt given to the evaluator LLM. In general, we think this option is great to have in your set up. For more on using LLMs as evaluators, check out our post: Can You Use LLMs as Evaluators? An LLM Evaluation Framework.

Best for: Tasks requiring complex evaluations (e.g., tone, relevance, coherence).
Example: A model could be asked to evaluate its response to a customer inquiry for empathy, rating itself on a scale of 1-5 using a prompt like: “On a scale of 1-5, rate the empathy of this response: {{response}}.”

‍

Choosing the right method

To put it briefly:

Code-based grading works well for tasks that have known answers
Human grading should be used when tasks involve creativity, subjectivity, or require some level of nuanced understanding.
LLM-based grading pairs nicely with human grading and offers a scalable way to evaluate tasks that fall between rigid rules and subjective judgments.

‍

Eval design principles

If you're just getting started with evaluations here are a few principles to keep in mind:

Tailor to your task: Design evaluations that closely reflect your specific use cases, including typical scenarios and edge cases.
Automate when possible: If possible, try to leverage automated grading in some way (PromptHub can help with this).
Favor quantity over perfection: Use a larger number of simpler evals rather than a few complex ones to make sure you are covering a wider base.

Example evaluations

In this section, we’ll walk through various evaluations, each paired with a specific use case and grading method.

Tone and coherence for a legal assistant – LLM-Based coherence evaluation

What it measures: This evaluates how well the model maintains a coherent and logical tone as a legal assitant drafting documents.

‍

Context utilization for a support bot – LLM-Based Likert scale

What it measures: Evaluates how well the chatbot maintains context from earlier interactions and uses it to inform responses, rated on a 1-5 scale for context depth.

Consistency for a customer query bot – ROUGE-L evaluation

What it measures: Assesses how consistent and similar the responses are when paraphrased or repeated queries are asked, focusing on capturing the core content using ROUGE-L scoring.

‍

Task accuracy for a survey classifier – LLM-Based binary classification

What it measures: Evaluates the accuracy of a survey classifier model, ensuring it categorizes responses correctly as binary (e.g., positive/negative) outcomes.

‍

Relevance and fidelity for a news summarizer – Cosine similarity evaluation

What it measures: Assesses how relevant and accurate a news summarizer is by comparing generated summaries to original articles using cosine similarity.

‍

Now that we've looked at a couple of methods to evaluate outputs, lets generate some test cases to run these evaluations.

Generating test cases

Evaluations are only as good as the test cases you evaluate against. Well-constructed test cases should cover a wide range of inputs and ensure the model can handle both standard and edge-case scenarios.

Start with common scenarios

Begin by creating test cases that reflect the most common or typical inputs your model will encounter. These represent the "happy path" scenarios.
Example: For a customer service chatbot, you might include queries like "What is your return policy?" or "How can I track my order?"

Include edge cases

Test cases should include unusual, ambiguous, or extreme inputs.
Examples of edge cases:
- Ambiguous inputs: "What is the return policy for items bought two months ago if I lost my receipt?"
- Irrelevant or nonexistent data: Input queries with irrelevant details or missing information to see how the model responds.
- Contradictory information: "I want to return this product, but I also want to keep it."

Leverage LLMs for test case generation

Use LLMs to generate a broad range of test cases quickly. Start with a few manually crafted examples and let the model generate variations.
Example: Provide a few baseline questions for an FAQ bot and have the LLM create paraphrased or alternative versions.
Prompt Example: "Generate variations of this question: 'How do I return an item I purchased?'"

Using a PromptHub form we built, you can generate test cases that cover happy paths and edge cases. Try it below or open it up in a new tab via this link.

‍

Test real-world user behavior

Gather real-world data from your users and create test cases that mirror actual queries

Iteratively expand your test suite

As you identify times where your prompt failed, edit the prompt and add new test cases to test against the fail cases
Example: If the model struggles with sarcastic inputs in sentiment analysis, include more cases that contain sarcasm: "Oh, great. Another broken product."

Wrapping up

Before you write a single prompt, you should define what success looks like and develop at least a few test cases and evaluation methods. These will help you down the road so you can benchmark performance and track how your prompt engineering efforts are relating to performance.
If you only remember three things from this post, remember these:

Evaluations are not a one-time task; they are part of a continuous improvement cycle.
Automation can improve the speed and scalability of your evals (PromptHub can help with this).
Generate and test across diverse test cases.

With your success criteria, evals, and test cases in place, now you can move onto the fun part, prompt engineering. To jump start the process, you can try out our automatic prompt generator and learn some of the basic principles in our other post here: 10 Best Practices for Prompt Engineering with Any Model.

Dan Cleary

Co-Founder