Configure Evaluators
This guide shows you how to configure evaluators for your LLM application.
Configuring evaluators
To create a new evaluator, click the Create New button in the Evaluators page.
Selecting evaluators
Agenta offers a growing list of pre-built evaluators suitable for most use cases. You can also create custom evaluators by writing your own Python function or use webhooks for evaluation.
Available Evaluators
| Evaluator Name | Use Case | Type | Description |
|---|---|---|---|
| Exact Match | Classification/Entity Extraction | Pattern Matching | Checks if the output exactly matches the expected result. |
| Contains JSON | Classification/Entity Extraction | Pattern Matching | Ensures the output contains valid JSON. |
| Regex Test | Classification/Entity Extraction | Pattern Matching | Checks if the output matches a given regex pattern. |
| JSON Field Match | Classification/Entity Extraction | Pattern Matching | Compares specific fields within JSON data. |
| JSON Diff Match | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values. |
| Similarity Match | Text Generation / Chatbot | Similarity Metrics | Compares generated output with expected using Jaccard similarity. |
| Semantic Similarity Match | Text Generation / Chatbot | Semantic Analysis | Compares the meaning of the generated output with the expected result. |
| Starts With | Text Generation / Chatbot | Pattern Matching | Checks if the output starts with a specified prefix. |
| Ends With | Text Generation / Chatbot | Pattern Matching | Checks if the output ends with a specified suffix. |
| Contains | Text Generation / Chatbot | Pattern Matching | Checks if the output contains a specific substring. |
| Contains Any | Text Generation / Chatbot | Pattern Matching | Checks if the output contains any of a list of substrings. |
| Contains All | Text Generation / Chatbot | Pattern Matching | Checks if the output contains all of a list of substrings. |
| Levenshtein Distance | Text Generation / Chatbot | Similarity Metrics | Calculates the Levenshtein distance between output and expected result. |
| LLM-as-a-judge | Text Generation / Chatbot | LLM-based | Sends outputs to an LLM model for critique and evaluation. |
| RAG Faithfulness | RAG / Text Generation / Chatbot | LLM-based | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
| RAG Context Relevancy | RAG / Text Generation / Chatbot | LLM-based | Measures the relevancy of retrieved documents to the given question in RAG. |
| Custom Code Evaluation | Custom Logic | Custom | Allows users to define their own evaluator in Python. |
| Webhook Evaluator | Custom Logic | Custom | Sends output to a webhook for external evaluation. |
Evaluators' playground
Each evaluator comes with its unique playground. For instance, in the screen below, the LLM-as-a-judge evaluator requires you to specify the prompt to use for the evaluation. You'll find detailed information about these parameters on each evaluator's documentation page.
The evaluator playground lets you test your evaluator with sample input to make sure it's configured correctly.
To use it, follow these steps:
- Load a test case from a test set
- Select a prompt and run it
- Run the evaluator to see the result
You can adjust the configuration until you are happy with the result. When finished, commit your changes.
Next steps
Explore the different evaluator types: