Evaluating your prompts just got smarter. Today, we're excited to introduce our brand new Evaluations feature, designed to empower your team with a quantitative, systematic approach to assessing prompt performance. Judging your prompts by hand is valuable, but it doesn't scale well. Using both the "eye test" and automated evaluations delivers the best of both worlds.

Key Points

  • Quantitative & Systematic: Gain objective insights about your prompts through string-based rules ("does the output contain X") and an LLM-as-a-judge.
  • Team-Friendly: Designed to be used by anyone on your team, regarless of technical ability.
  • Runs using your model configuration: When using an LLM evaluator, it will run using the configuration set in the playground—unless you select a prompt from your library as the evaluator (more details in our docs).
  • Available on the Team and Enterprise plans: Evaluations are currently only available on the Team plan.

Benefits

  • Data-Driven Decisions: Make informed improvements based on measurable criteria, not just vibes
  • Leverage evaluator prompts from the community: Search through the PromptHub community for evaluator prompts to draw inspiration
  • Leverage LLM Intelligence: Utilize the power of LLMs to provide nuanced feedback that complements human evaluations

How It Works

Our Evaluations feature is built for simplicity and flexibility:

  • Setting Up Evaluations:
    Navigate to the Evals tab, where you can create evaluations. You have two evaluation methods to choose from:
    • String-Based Evaluations:
      Use operations such as “regex contains” or “does not contain” to verify if your output meets specific criteria.
    • LLM-as-a-Judge:
      Write a custom evaluator prompt or select one from your library to let an LLM assess your output.
  • Running Evaluations:
    Go to the playground for the prompt you want to test, scroll down, and click the checkbox next to any evaluator you want to run (you can run multiple evaluators!). Then click Run Test. Evaluations can be executed in single shot, in batch tests, or in batch tests that include datasets.

Demos

See Evaluations in action with these demo videos:

These demos walk you through various use cases and illustrate how the Evaluations feature can streamline your prompt refinement process.

Notes

  • Available on Team and Enterprise plans
  • Supports string based and LLM-as-a-judge evals currently, but we plan to expand

Get started with evaluations!

Headshot of PromptHub co-founder Dan Cleary
Dan Cleary
Founder