Skip to main content

How to Run Human Annotations on your LLM Application

Human evaluation lets you evaluate your LLM application's performance using human judgment instead of automated metrics.

⏯️ Watch a short demo of the human evaluation feature.

Why use human evaluation?

Automated metrics can't capture everything. Sometimes you need human experts to evaluate results and identify why errors occur.

Human evaluation helps you:

  • Get expert feedback to compare different versions of your application
  • Collect human feedback and insights to improve your prompts and configuration
  • Collect annotations to bootstrap automated evaluation

How human evaluation works

Human evaluation follows the same process as automatic evaluation:

  1. Choose a test set
  2. Select the versions you want to evaluate
  3. Pick your evaluators
  4. Start the evaluation

The only difference is that humans provide the evaluation scores instead of automated systems.

Single model evaluation

Start a new evaluation

  1. Go to the Evaluations page
  2. Select the Human annotation tab
  3. Click Start new evaluation

Configure your evaluation

  1. Select your test set - Choose the data you want to evaluate against
  2. Select your revision - Pick the version of your application to test
warning

Your test set columns must match the input variables in your revision. If they don't match, you'll see an error message.

  1. Choose evaluators - Select how you want to measure performance

Create evaluators (optional)

If you don't have evaluators yet, click Create new in the Evaluator section.

Each evaluator has:

  • Name - What you're measuring (e.g., "correctness")
  • Description - What the evaluator does
  • Feedback types - How evaluators will score responses

For example, a "correctness" evaluator might have:

  • is_correct - A yes/no question about accuracy
  • error_type - A multiple-choice field for categorizing mistakes

Available feedback types:

  • Boolean - Yes/no questions
  • Integer - Whole number ratings
  • Decimal - Precise numerical scores
  • Single-choice - Pick one option
  • Multi-choice - Pick multiple options
  • String - Free-text comments or notes
tip

Evaluators can include multiple related feedback types. For example:

Correctness evaluator:

  • is_correct - Yes/no question about accuracy
  • error_type - Multiple-choice field to categorize mistakes (only if incorrect)

Style adherence evaluator:

  • is_adherent - Yes/no question about style compliance
  • comment - Text field explaining why the style doesn't match (if needed)

This grouping helps you evaluate different aspects of your LLM's performance in an organized way.

Run the evaluation

After creating your evaluators:

  1. Select the evaluators you want to use
  2. Click Start evaluation
  3. You'll be redirected to the annotation interface
  4. Click Run all to generate outputs and begin evaluation

Annotate responses

For each test case:

  1. Review the input and output
  2. Use the evaluation form on the right to score the response
  3. Click Annotate to save your assessment
  4. Click Next to move to the next test case
tip

Select the Unannotated tab to see only the test cases you haven't reviewed yet.

Review results

After completing all annotations:

  • View results in the Results section
  • Compare performance with other experiments
  • Export results to CSV using Export results
  • Save annotated data as a new test set with Save test set

A/B testing evaluation

A/B testing lets you compare two versions of your application side-by-side. For each test case, you choose which version performs better.

Set up A/B testing

  1. Select two versions you want to compare
  2. Choose your test set
  3. For each test case, decide which version is better (or if they're equal)

Collaborate with your team

You can invite team members to help with A/B testing by sharing the evaluation link. Team members must be added to your workspace first.

A/B testing features

During A/B evaluation, you can:

  • Compare variants - Score which version performs better for each test case
  • Add notes - Include context or detailed feedback
  • Export results - Download your evaluation data for further analysis