How to Run Human Annotations on your LLM Application
Human evaluation lets you evaluate your LLM application's performance using human judgment instead of automated metrics.
⏯️ Watch a short demo of the human evaluation feature.
Why use human evaluation?
Automated metrics can't capture everything. Sometimes you need human experts to evaluate results and identify why errors occur.
Human evaluation helps you:
- Get expert feedback to compare different versions of your application
- Collect human feedback and insights to improve your prompts and configuration
- Collect annotations to bootstrap automated evaluation
How human evaluation works
Human evaluation follows the same process as automatic evaluation:
- Choose a test set
- Select the versions you want to evaluate
- Pick your evaluators
- Start the evaluation
The only difference is that humans provide the evaluation scores instead of automated systems.
Single model evaluation
Start a new evaluation
- Go to the Evaluations page
- Select the Human annotation tab
- Click Start new evaluation
Configure your evaluation
- Select your test set - Choose the data you want to evaluate against
- Select your revision - Pick the version of your application to test
Your test set columns must match the input variables in your revision. If they don't match, you'll see an error message.
- Choose evaluators - Select how you want to measure performance
Create evaluators (optional)
If you don't have evaluators yet, click Create new in the Evaluator section.
Each evaluator has:
- Name - What you're measuring (e.g., "correctness")
- Description - What the evaluator does
- Feedback types - How evaluators will score responses
For example, a "correctness" evaluator might have:
is_correct
- A yes/no question about accuracyerror_type
- A multiple-choice field for categorizing mistakes
Available feedback types:
- Boolean - Yes/no questions
- Integer - Whole number ratings
- Decimal - Precise numerical scores
- Single-choice - Pick one option
- Multi-choice - Pick multiple options
- String - Free-text comments or notes
Evaluators can include multiple related feedback types. For example:
Correctness evaluator:
is_correct
- Yes/no question about accuracyerror_type
- Multiple-choice field to categorize mistakes (only if incorrect)
Style adherence evaluator:
is_adherent
- Yes/no question about style compliancecomment
- Text field explaining why the style doesn't match (if needed)
This grouping helps you evaluate different aspects of your LLM's performance in an organized way.
Run the evaluation
After creating your evaluators:
- Select the evaluators you want to use
- Click Start evaluation
- You'll be redirected to the annotation interface
- Click Run all to generate outputs and begin evaluation
Annotate responses
For each test case:
- Review the input and output
- Use the evaluation form on the right to score the response
- Click Annotate to save your assessment
- Click Next to move to the next test case
Select the Unannotated tab to see only the test cases you haven't reviewed yet.
Review results
After completing all annotations:
- View results in the Results section
- Compare performance with other experiments
- Export results to CSV using Export results
- Save annotated data as a new test set with Save test set
A/B testing evaluation
A/B testing lets you compare two versions of your application side-by-side. For each test case, you choose which version performs better.
Set up A/B testing
- Select two versions you want to compare
- Choose your test set
- For each test case, decide which version is better (or if they're equal)
Collaborate with your team
You can invite team members to help with A/B testing by sharing the evaluation link. Team members must be added to your workspace first.
A/B testing features
During A/B evaluation, you can:
- Compare variants - Score which version performs better for each test case
- Add notes - Include context or detailed feedback
- Export results - Download your evaluation data for further analysis