Skip to content

Commit 7542bf6

Browse files
authored
AI freshness April: update evaluation and tokenizer docs (#52961)
1 parent 29cc987 commit 7542bf6

6 files changed

Lines changed: 57 additions & 60 deletions

File tree

docs/ai/evaluation/evaluate-ai-response.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Quickstart - Evaluate the quality of a model's response
33
description: Learn how to create an MSTest app to evaluate the AI chat response of a language model.
4-
ms.date: 03/03/2026
4+
ms.date: 04/09/2026
55
ms.topic: quickstart
66
ai-usage: ai-assisted
77
---
@@ -53,7 +53,7 @@ Complete the following steps to create an MSTest project that connects to an AI
5353
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-ID>
5454
```
5555
56-
(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
56+
(Depending on your environment, you might not need the tenant ID. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
5757
5858
1. Open the new app in your editor of choice.
5959
@@ -84,7 +84,7 @@ Complete the following steps to create an MSTest project that connects to an AI
8484
8585
This method does the following:
8686
87-
- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range.
87+
- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that typically represents numeric scores that fall within a well-defined range.
8888
- Retrieves the coherence score from the <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult>.
8989
- Validates the *default interpretation* for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed.
9090
- Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation.

docs/ai/evaluation/evaluate-safety.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Tutorial - Evaluate response safety with caching and reporting
33
description: Create an MSTest app that evaluates the content safety of a model's response using the evaluators in the Microsoft.Extensions.AI.Evaluation.Safety package and with caching and reporting.
4-
ms.date: 03/03/2026
4+
ms.date: 04/09/2026
55
ms.topic: tutorial
66
ai-usage: ai-assisted
77
---
@@ -20,13 +20,13 @@ In this tutorial, you create an MSTest app to evaluate the *content safety* of a
2020
To provision an Azure OpenAI service and model using the Azure portal, complete the steps in the [Create and deploy an Azure OpenAI Service resource](/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) article. In the "Deploy a model" step, select the `gpt-5` model.
2121

2222
> [!TIP]
23-
> The previous configuration step is only required to fetch the response to be evaluated. To evaluate the safety of a response you already have in hand, you can skip this configuration.
23+
> You only need the previous configuration step to fetch the response to evaluate. To evaluate the safety of a response you already have, skip this configuration.
2424
2525
The evaluators in this tutorial use the Foundry Evaluation service, which requires some additional setup:
2626

2727
- [Create a resource group](/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups) within one of the Azure [regions that support Foundry Evaluation service](/azure/ai-foundry/how-to/develop/evaluate-sdk#region-support).
2828
- [Create a Foundry hub](/azure/ai-foundry/how-to/create-azure-ai-resource?tabs=portal#create-a-hub-in-azure-ai-foundry-portal) in the resource group you just created.
29-
- Finally, [create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.
29+
- [Create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.
3030

3131
## Create the test app
3232

@@ -63,7 +63,7 @@ Complete the following steps to create an MSTest project.
6363
dotnet user-secrets set AZURE_AI_PROJECT <your-Azure-AI-project>
6464
```
6565

66-
(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
66+
(Depending on your environment, you might not need the tenant ID. If so, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
6767

6868
1. Open the new app in your editor of choice.
6969

@@ -85,9 +85,9 @@ Complete the following steps to create an MSTest project.
8585
The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string of your choice. Here are some considerations for choosing a scenario name:
8686

8787
- When using disk-based storage, the scenario name is used as the name of the folder under which the corresponding evaluation results are stored.
88-
- By default, the generated evaluation report splits scenario names on `.` so that the results can be displayed in a hierarchical view with appropriate grouping, nesting, and aggregation.
88+
- By default, the generated evaluation report splits scenario names on `.` so the report displays results in a hierarchical view with appropriate grouping, nesting, and aggregation.
8989

90-
The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs will use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.
90+
The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.
9191

9292
1. Add a method to gather the safety evaluators to use in the evaluation.
9393

@@ -97,26 +97,26 @@ Complete the following steps to create an MSTest project.
9797

9898
:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ServiceConfig":::
9999

100-
1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which will be used to get the chat response to evaluate from the LLM.
100+
1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which gets the chat response to evaluate from the LLM.
101101

102102
:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ChatClient":::
103103

104104
1. Set up the reporting functionality. Convert the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration> to a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration>, and then pass that to the method that creates a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>.
105105

106106
:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ReportingConfig":::
107107

108-
Response caching functionality is supported and works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response will be reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, is changed.
108+
Response caching works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response is reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, changes.
109109

110110
> [!NOTE]
111-
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. The reason to include the LLM chat client here is to enable getting a chat response from the LLM, and notably, to enable response caching for it. (If you don't want to cache the LLM's response, you can create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
111+
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. Including the LLM chat client here enables getting a chat response from the LLM and enables response caching for the response. (To skip caching the LLM's response, create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
112112
>
113-
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service&ndash;based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. Then it returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.
113+
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service&ndash;based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. The method then returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.
114114
115115
1. Add a method to define the [chat options](xref:Microsoft.Extensions.AI.ChatOptions) and ask the model for a response to a given question.
116116

117117
:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="GetResponse":::
118118

119-
The test in this tutorial evaluates the LLM's response to an astronomy question. Since the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and since the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test is cached and reused.
119+
The test in this tutorial evaluates the LLM's response to an astronomy question. Because the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and because the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test gets cached and reused.
120120

121121
1. Add a method to validate the response.
122122

@@ -129,16 +129,16 @@ Complete the following steps to create an MSTest project.
129129

130130
:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="TestMethod":::
131131

132-
This test method:
132+
The test method:
133133

134-
- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. The use of `await using` ensures that the `ScenarioRun` is correctly disposed and that the results of this evaluation are correctly persisted to the result store.
135-
- Gets the LLM's response to a specific astronomy question. The same <xref:Microsoft.Extensions.AI.IChatClient> that will be used for evaluation is passed to the `GetAstronomyConversationAsync` method in order to get *response caching* for the primary LLM response being evaluated. (In addition, this enables response caching for the responses that the evaluators fetch from the Foundry Evaluation service as part of performing their evaluations.)
136-
- Runs the evaluators against the response. Like the LLM response, on subsequent runs, the evaluation is fetched from the (disk-based) response cache that was configured in `s_safetyReportingConfig`.
134+
- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. `await using` ensures that `ScenarioRun` is correctly disposed and that the evaluation results are correctly persisted to the result store.
135+
- Gets the LLM's response to a specific astronomy question. The test passes the same <xref:Microsoft.Extensions.AI.IChatClient> used for evaluation to `GetAstronomyConversationAsync` to enable *response caching* for the primary LLM response being evaluated. (In addition, passing the same <xref:Microsoft.Extensions.AI.IChatClient> enables response caching for the evaluator responses from the Foundry Evaluation service.)
136+
- Runs the evaluators against the response. Like the LLM response, subsequent runs fetch the evaluation from the (disk-based) response cache configured in `s_safetyReportingConfig`.
137137
- Runs some safety validation on the evaluation result.
138138

139139
## Run the test/evaluation
140140

141-
Run the test using your preferred test workflow, for example, by using the CLI command `dotnet test` or through [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).
141+
Run the test using your preferred test workflowfor example, by using the CLI command `dotnet test` or [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).
142142

143143
## Generate a report
144144

@@ -148,6 +148,6 @@ To generate a report to view the evaluation results, see [Generate a report](eva
148148

149149
This tutorial covers the basics of evaluating content safety. As you create your test suite, consider the following next steps:
150150

151-
- Configure additional evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
151+
- Configure more evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
152152
- Evaluate the content safety of generated images. For an example, see the AI samples repo [image response example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example09_RunningSafetyEvaluatorsAgainstResponsesWithImages.cs).
153-
- In real-world evaluations, you might not want to validate individual results, since the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when this happens. Instead, in such cases, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).
153+
- In real-world evaluations, you might not want to validate individual results, because the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when evaluation scores change. Instead, consider relying on the generated report and tracking the overall trends for evaluation scores across different scenarios over time (and only failing individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).

0 commit comments

Comments
 (0)