Concepts
What is evaluation?
The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters.
Key concepts
Evaluators
Evaluators are functions that assess the output of an LLM application.
Evaluators typically take as input:
- The output of the LLM application
- (Optional) The reference answer (i.e., expected output or ground truth)
- (Optional) The inputs to the LLM application
- Any other relevant data, such as context
Evaluators return different types of results based on the evaluator type. Simple evaluators return single values like boolean (true/false) or numeric scores. Evaluators with schemas (such as LLM-as-a-Judge or custom evaluators) can return structured results with multiple fields, allowing you to capture various aspects of the evaluation in a single result.
Test sets
Test sets are collections of test cases used to evaluate your LLM application. Each test case contains:
- Inputs: The data your LLM application expects (required)
- Ground Truth: The expected answer from your application (optional, often stored as "correct_answer")
- Annotations: Additional metadata or rules about the test case (optional)
Test sets are critical for:
- Evaluating your application systematically
- Finding edge cases
- Preventing regressions
- Measuring improvements over time
Evaluation workflows
Agenta supports multiple evaluation workflows:
- Automated Evaluation (UI): Run evaluations from the web interface with configurable evaluators
- Automated Evaluation (SDK): Run evaluations programmatically for integration into CI/CD pipelines
- Online Evaluation: Run evaluations on new traces as they are generated by your LLM application
- Human Evaluation: Collect expert feedback and annotations for qualitative assessment
Next steps
- Configure evaluators for your use case
- Create test sets from various sources
- Run your first evaluation from the UI