Testing Using Golden Test Set

Updated 

Overview

The Golden Test Set feature makes it simple to evaluate how well your models are performing.

With this tool, you can compare your bot's responses to a set of ideal responses for different user queries. This helps you see how accurate your bot is, especially when it comes to the information stored in your content sources.

By using this feature, you can assess your model's performance before you even put it into action in your Conversational AI bot. This lets you make any necessary improvements beforehand, ensuring your bot is as effective as possible.

To Setup the Golden Test Set

  1. On the Content Sources window, select the Open Golden Test Set option, represented by a test tube icon, on the top to navigate to the Golden Test Set Classification Report window.

  2. On the Golden Test Set Classification Report window, click Manage Set at the top.

  3. On the Manage Set window, click Add Question & Answer in the top right corner.

  4. On the Add Golden Set Question & Answer window, input the question and its corresponding ideal answer. Click Save to store the information. You can add as many questions and answers as needed.

  5. Navigate back to the Golden Test Set Classification Report window and click Calculate Performance at the top. The Gen AI Answer column will be populated with AI-generated responses based on the imported content. You can hover over each response to read it and evaluate its quality and relevance. This allows you to compare it with the ideal response that you've provided.

  6. You can also access various scores at the top of the window:

    Retrieval Precision: This metric evaluates the precision of the information retrieved by the AI model.

    Content Recall: It measures the amount of relevant information correctly recollected and returned by the AI model.

    Context Retention: This metric assesses the AI model's ability to maintain a consistent flow and depth in a conversation.

    Engagement Capability: This score considers factors such as grammar, NLP metrics, and explainability to evaluate the AI model's ability to engage users effectively.

    Faithfulness: It measures how accurately the AI model produces correct responses and also evaluates the extent to which the model generates incorrect or "hallucinated" responses.

    Adversarial Guardrail Score: This metric evaluates the AI model's robustness and its protection against jailbreak attempts, assessing its efficiency in upholding guardrails.

What's Next?

Once you've thoroughly tested the model to your satisfaction, you're ready to deploy it within a Conversational AI application.