Skip to main content

A/B Testing

Overview

A/B testing lets you compare two versions of your application side-by-side. For each test case, you choose which version performs better.

Setting up A/B testing

Select two versions you want to compare
Choose your test set
For each test case, decide which version is better (or if they're equal)

A/B testing features

During A/B evaluation, you can:

Compare variants - Score which version performs better for each test case
Add notes - Include context or detailed feedback
Export results - Download your evaluation data for further analysis

Collaborating on A/B tests

You can invite team members to help with A/B testing by sharing the evaluation link. Team members must be added to your workspace first.

This is particularly useful for:

Getting diverse perspectives on performance
Reducing individual bias
Speeding up evaluation with multiple annotators

Interpreting A/B test results

After completing the A/B test, you'll see:

Win/loss/tie counts for each variant
Percentage of cases where each variant performed better
Specific test cases where variants differed significantly
Notes and comments from annotators

Use cases

A/B testing is ideal for:

Prompt optimization: Compare different prompt wordings
Model selection: Evaluate different LLM models (GPT-4 vs Claude vs others)
Parameter tuning: Test different temperature or max_tokens settings
Feature comparison: Compare variants with different features enabled

Next steps

Learn about exporting results
Explore automated evaluation for larger-scale comparisons
Understand evaluation concepts

Overview
Setting up A/B testing
A/B testing features
Collaborating on A/B tests
Interpreting A/B test results
Use cases
Next steps