Quick Start
Overview
Human evaluation lets you evaluate your LLM application's performance using human judgment instead of automated metrics.
⏯️ Watch a short demo of the human evaluation feature.
Why use human evaluation?
Automated metrics can't capture everything. Sometimes you need human experts to evaluate results and identify why errors occur.
Human evaluation helps you:
- Get expert feedback to compare different versions of your application
- Collect human feedback and insights to improve your prompts and configuration
- Collect annotations to bootstrap automated evaluation
How human evaluation works
Human evaluation follows the same process as automatic evaluation:
- Choose a test set
- Select the versions you want to evaluate
- Pick your evaluators
- Start the evaluation
The only difference is that humans provide the evaluation scores instead of automated systems.
Quick workflow
- Start evaluation: Go to Evaluations → Human annotation → Start new evaluation
- Select test set: Choose the data you want to evaluate against
- Select variant: Pick the version of your application to test
- Configure evaluators: Create or select evaluators (boolean, integer, multi-choice, etc.)
- Run: Click "Start evaluation" and generate outputs
- Annotate: Review each response and provide feedback
- Review results: Analyze aggregated scores and export data
Next steps
- Learn about configuring evaluators
- Understand how to run evaluations
- Explore viewing results
- Try A/B testing