Quick Start

Overview

Human evaluation lets you evaluate your LLM application's performance using human judgment instead of automated metrics.

⏯️ Watch a short demo of the human evaluation feature.

Automated metrics can't capture everything. Sometimes you need human experts to evaluate results and identify why errors occur.

Human evaluation helps you:

Human evaluation follows the same process as automatic evaluation:

The only difference is that humans provide the evaluation scores instead of automated systems.

Start evaluation: Go to Evaluations → Human annotation → Start new evaluation
Select test set: Choose the data you want to evaluate against
Select variant: Pick the version of your application to test
Configure evaluators: Create or select evaluators (boolean, integer, multi-choice, etc.)
Run: Click "Start evaluation" and generate outputs
Annotate: Review each response and provide feedback
Review results: Analyze aggregated scores and export data