You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/evaluation/evaluate-ai-response.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Quickstart - Evaluate the quality of a model's response
3
3
description: Learn how to create an MSTest app to evaluate the AI chat response of a language model.
4
-
ms.date: 03/03/2026
4
+
ms.date: 04/09/2026
5
5
ms.topic: quickstart
6
6
ai-usage: ai-assisted
7
7
---
@@ -53,7 +53,7 @@ Complete the following steps to create an MSTest project that connects to an AI
53
53
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-ID>
54
54
```
55
55
56
-
(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
56
+
(Depending on your environment, you might not need the tenant ID. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
57
57
58
58
1. Open the new app in your editor of choice.
59
59
@@ -84,7 +84,7 @@ Complete the following steps to create an MSTest project that connects to an AI
84
84
85
85
This method does the following:
86
86
87
-
- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range.
87
+
- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that typically represents numeric scores that fall within a well-defined range.
88
88
- Retrieves the coherence score from the <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult>.
89
89
- Validates the *default interpretation* for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed.
90
90
- Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation.
Copy file name to clipboardExpand all lines: docs/ai/evaluation/evaluate-safety.md
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Tutorial - Evaluate response safety with caching and reporting
3
3
description: Create an MSTest app that evaluates the content safety of a model's response using the evaluators in the Microsoft.Extensions.AI.Evaluation.Safety package and with caching and reporting.
4
-
ms.date: 03/03/2026
4
+
ms.date: 04/09/2026
5
5
ms.topic: tutorial
6
6
ai-usage: ai-assisted
7
7
---
@@ -20,13 +20,13 @@ In this tutorial, you create an MSTest app to evaluate the *content safety* of a
20
20
To provision an Azure OpenAI service and model using the Azure portal, complete the steps in the [Create and deploy an Azure OpenAI Service resource](/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) article. In the "Deploy a model" step, select the `gpt-5` model.
21
21
22
22
> [!TIP]
23
-
> The previous configuration step is only required to fetch the response to be evaluated. To evaluate the safety of a response you already have in hand, you can skip this configuration.
23
+
> You only need the previous configuration step to fetch the response to evaluate. To evaluate the safety of a response you already have, skip this configuration.
24
24
25
25
The evaluators in this tutorial use the Foundry Evaluation service, which requires some additional setup:
26
26
27
27
-[Create a resource group](/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups) within one of the Azure [regions that support Foundry Evaluation service](/azure/ai-foundry/how-to/develop/evaluate-sdk#region-support).
28
28
-[Create a Foundry hub](/azure/ai-foundry/how-to/create-azure-ai-resource?tabs=portal#create-a-hub-in-azure-ai-foundry-portal) in the resource group you just created.
29
-
-Finally, [create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.
29
+
-[Create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.
30
30
31
31
## Create the test app
32
32
@@ -63,7 +63,7 @@ Complete the following steps to create an MSTest project.
63
63
dotnet user-secrets set AZURE_AI_PROJECT <your-Azure-AI-project>
64
64
```
65
65
66
-
(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
66
+
(Depending on your environment, you might not need the tenant ID. If so, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
67
67
68
68
1. Open the new app in your editor of choice.
69
69
@@ -85,9 +85,9 @@ Complete the following steps to create an MSTest project.
85
85
The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string of your choice. Here are some considerations for choosing a scenario name:
86
86
87
87
- When using disk-based storage, the scenario name is used as the name of the folder under which the corresponding evaluation results are stored.
88
-
- By default, the generated evaluation report splits scenario names on `.` so that the results can be displayed in a hierarchical view with appropriate grouping, nesting, and aggregation.
88
+
- By default, the generated evaluation report splits scenario names on `.` so the report displays results in a hierarchical view with appropriate grouping, nesting, and aggregation.
89
89
90
-
The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs will use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.
90
+
The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.
91
91
92
92
1. Add a method to gather the safety evaluators to use in the evaluation.
93
93
@@ -97,26 +97,26 @@ Complete the following steps to create an MSTest project.
1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which will be used to get the chat response to evaluate from the LLM.
100
+
1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which gets the chat response to evaluate from the LLM.
1. Set up the reporting functionality. Convert the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration> to a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration>, and then pass that to the method that creates a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>.
Response caching functionality is supported and works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response will be reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, is changed.
108
+
Response caching works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response is reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, changes.
109
109
110
110
> [!NOTE]
111
-
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. The reason to include the LLM chat client here is to enable getting a chat response from the LLM, and notably, to enable response caching for it. (If you don't want to cache the LLM's response, you can create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
111
+
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. Including the LLM chat client here enables getting a chat response from the LLM and enables response caching for the response. (To skip caching the LLM's response, create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
112
112
>
113
-
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service–based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. Then it returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.
113
+
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service–based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. The method then returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.
114
114
115
115
1. Add a method to define the [chat options](xref:Microsoft.Extensions.AI.ChatOptions) and ask the model for a response to a given question.
The test in this tutorial evaluates the LLM's response to an astronomy question. Since the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and since the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test is cached and reused.
119
+
The test in this tutorial evaluates the LLM's response to an astronomy question. Because the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and because the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test gets cached and reused.
120
120
121
121
1. Add a method to validate the response.
122
122
@@ -129,16 +129,16 @@ Complete the following steps to create an MSTest project.
- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. The use of `await using` ensures that the `ScenarioRun` is correctly disposed and that the results of this evaluation are correctly persisted to the result store.
135
-
- Gets the LLM's response to a specific astronomy question. The same <xref:Microsoft.Extensions.AI.IChatClient>that will be used for evaluation is passed to the `GetAstronomyConversationAsync`method in order to get*response caching* for the primary LLM response being evaluated. (In addition, this enables response caching for the responses that the evaluators fetch from the Foundry Evaluation service as part of performing their evaluations.)
136
-
- Runs the evaluators against the response. Like the LLM response, on subsequent runs, the evaluation is fetched from the (disk-based) response cache that was configured in `s_safetyReportingConfig`.
134
+
- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. `await using` ensures that `ScenarioRun` is correctly disposed and that the evaluation results are correctly persisted to the result store.
135
+
- Gets the LLM's response to a specific astronomy question. The test passes the same <xref:Microsoft.Extensions.AI.IChatClient> used for evaluation to `GetAstronomyConversationAsync` to enable*response caching* for the primary LLM response being evaluated. (In addition, passing the same <xref:Microsoft.Extensions.AI.IChatClient>enables response caching for the evaluator responses from the Foundry Evaluation service.)
136
+
- Runs the evaluators against the response. Like the LLM response, subsequent runs fetch the evaluation from the (disk-based) response cache configured in `s_safetyReportingConfig`.
137
137
- Runs some safety validation on the evaluation result.
138
138
139
139
## Run the test/evaluation
140
140
141
-
Run the test using your preferred test workflow, for example, by using the CLI command `dotnet test` or through[Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).
141
+
Run the test using your preferred test workflow—for example, by using the CLI command `dotnet test` or [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).
142
142
143
143
## Generate a report
144
144
@@ -148,6 +148,6 @@ To generate a report to view the evaluation results, see [Generate a report](eva
148
148
149
149
This tutorial covers the basics of evaluating content safety. As you create your test suite, consider the following next steps:
150
150
151
-
- Configure additional evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
151
+
- Configure more evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
152
152
- Evaluate the content safety of generated images. For an example, see the AI samples repo [image response example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example09_RunningSafetyEvaluatorsAgainstResponsesWithImages.cs).
153
-
- In real-world evaluations, you might not want to validate individual results, since the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when this happens. Instead, in such cases, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).
153
+
- In real-world evaluations, you might not want to validate individual results, because the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when evaluation scores change. Instead, consider relying on the generated report and tracking the overall trends for evaluation scores across different scenarios over time (and only failing individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).
0 commit comments