Feature Launch: Evaluations

Evaluating your prompts just got smarter. Today, we're excited to introduce our brand new Evaluations feature, designed to empower your team with a quantitative, systematic approach to assessing prompt performance. Judging your prompts by hand is valuable, but it doesn't scale well. Using both the "eye test" and automated evaluations delivers the best of both worlds.

Key Points

Quantitative & Systematic: Gain objective insights about your prompts through string-based rules ("does the output contain X") and an LLM-as-a-judge.
Team-Friendly: Designed to be used by anyone on your team, regarless of technical ability.
Runs using your model configuration: When using an LLM evaluator, it will run using the configuration set in the playground—unless you select a prompt from your library as the evaluator (more details in our docs).
Available on the Team and Enterprise plans: Evaluations are currently only available on the Team plan.

Benefits

Data-Driven Decisions: Make informed improvements based on measurable criteria, not just vibes
Leverage evaluator prompts from the community: Search through the PromptHub community for evaluator prompts to draw inspiration
Leverage LLM Intelligence: Utilize the power of LLMs to provide nuanced feedback that complements human evaluations

How It Works

Our Evaluations feature is built for simplicity and flexibility:

Setting Up Evaluations:
Navigate to the Evals tab, where you can create evaluations. You have two evaluation methods to choose from:
- String-Based Evaluations:
  Use operations such as “regex contains” or “does not contain” to verify if your output meets specific criteria.
- LLM-as-a-Judge:
  Write a custom evaluator prompt or select one from your library to let an LLM assess your output.
Running Evaluations:
Go to the playground for the prompt you want to test, scroll down, and click the checkbox next to any evaluator you want to run (you can run multiple evaluators!). Then click Run Test. Evaluations can be executed in single shot, in batch tests, or in batch tests that include datasets.