How to Run Human Annotations on your LLM Application

Human evaluation lets you evaluate your LLM application's performance using human judgment instead of automated metrics.

⏯️ Watch a short demo of the human evaluation feature.

Why use human evaluation?

Automated metrics can't capture everything. Sometimes you need human experts to evaluate results and identify why errors occur.

Human evaluation helps you:

Get expert feedback to compare different versions of your application
Collect human feedback and insights to improve your prompts and configuration
Collect annotations to bootstrap automated evaluation

How human evaluation works

Human evaluation follows the same process as automatic evaluation:

Choose a test set
Select the versions you want to evaluate
Pick your evaluators
Start the evaluation

The only difference is that humans provide the evaluation scores instead of automated systems.

Single model evaluation

Start a new evaluation

Go to the Evaluations page
Select the Human annotation tab
Click Start new evaluation

Configure your evaluation

Select your test set - Choose the data you want to evaluate against
Select your revision - Pick the version of your application to test

warning

Your test set columns must match the input variables in your revision. If they don't match, you'll see an error message.

Choose evaluators - Select how you want to measure performance

Create evaluators (optional)

If you don't have evaluators yet, click Create new in the Evaluator section.

Each evaluator has:

Name - What you're measuring (e.g., "correctness")
Description - What the evaluator does
Feedback types - How evaluators will score responses

For example, a "correctness" evaluator might have:

is_correct - A yes/no question about accuracy
error_type - A multiple-choice field for categorizing mistakes

Available feedback types:

Boolean - Yes/no questions
Integer - Whole number ratings
Decimal - Precise numerical scores
Single-choice - Pick one option
Multi-choice - Pick multiple options
String - Free-text comments or notes

tip

Evaluators can include multiple related feedback types. For example:

Correctness evaluator:

is_correct - Yes/no question about accuracy
error_type - Multiple-choice field to categorize mistakes (only if incorrect)

Style adherence evaluator:

is_adherent - Yes/no question about style compliance
comment - Text field explaining why the style doesn't match (if needed)

This grouping helps you evaluate different aspects of your LLM's performance in an organized way.

Run the evaluation

After creating your evaluators:

Select the evaluators you want to use
Click Start evaluation
You'll be redirected to the annotation interface
Click Run all to generate outputs and begin evaluation

Annotate responses

For each test case:

Review the input and output
Use the evaluation form on the right to score the response
Click Annotate to save your assessment
Click Next to move to the next test case

tip

Select the Unannotated tab to see only the test cases you haven't reviewed yet.

Review results

After completing all annotations:

View results in the Results section
Compare performance with other experiments
Export results to CSV using Export results
Save annotated data as a new test set with Save test set

A/B testing evaluation

A/B testing lets you compare two versions of your application side-by-side. For each test case, you choose which version performs better.

Set up A/B testing

Select two versions you want to compare
Choose your test set
For each test case, decide which version is better (or if they're equal)

Collaborate with your team

You can invite team members to help with A/B testing by sharing the evaluation link. Team members must be added to your workspace first.

A/B testing features

During A/B evaluation, you can:

Compare variants - Score which version performs better for each test case
Add notes - Include context or detailed feedback
Export results - Download your evaluation data for further analysis

Why use human evaluation?​

How human evaluation works​

Single model evaluation​

Start a new evaluation​

Configure your evaluation​

Create evaluators (optional)​

Run the evaluation​

Annotate responses​

Review results​

A/B testing evaluation​

Set up A/B testing​

Collaborate with your team​

A/B testing features​