Skip to content

Commit bb92c41

Browse files
committed
Add more evaluation examples
1 parent fc4de28 commit bb92c41

18 files changed

Lines changed: 1164 additions & 313 deletions

.devcontainer/docker-compose.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ services:
2222
POSTGRES_USER: admin
2323
POSTGRES_PASSWORD: LocalPasswordOnly
2424
ports:
25-
- "5432:5432"
25+
- "5433:5432"
2626

2727
redis:
2828
image: redis/redis-stack-server:latest
2929
restart: unless-stopped
3030
ports:
31-
- '6379:6379'
31+
- '6380:6379'
3232

3333
aspire-dashboard:
3434
image: mcr.microsoft.com/dotnet/aspire-dashboard:latest

AGENTS.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,15 @@ MAF documentation is available on Microsoft Learn here:
1414
https://learn.microsoft.com/agent-framework/
1515
When available, the MS Learn MCP server can be used to explore the documentation, ask questions, and get code examples.
1616

17+
## Package management
18+
19+
This project uses [uv](https://docs.astral.sh/uv/) for dependency management. Use `uv` commands instead of `pip`:
20+
21+
```bash
22+
uv add <package>
23+
uv sync
24+
```
25+
1726
## Spanish translations
1827

1928
There are Spanish equivalents of each example in /examples/spanish.
@@ -32,3 +41,52 @@ Each example .py file should have a corresponding _spanish.py file that is the t
3241
Use informal (tuteo) LATAM Spanish, tu not usted, puedes not podes, etc. The content is technical so if a word is best kept in English, then do so.
3342

3443
The /examples/spanish/README.md corresponds to the root README.md and should be kept in sync with it, but translated to Spanish.
44+
45+
## Debugging Azure Python SDK HTTP requests
46+
47+
When debugging HTTP interactions between Azure Python SDKs (like `azure-ai-evaluation`) and Azure services, there are three levels of logging you can enable:
48+
49+
### 1. Azure SDK logger (request headers and URLs)
50+
51+
Set the Azure SDK loggers to DEBUG level to see request URLs, headers, and status codes:
52+
53+
```python
54+
import logging
55+
56+
logging.basicConfig(level=logging.WARNING)
57+
logging.getLogger("azure").setLevel(logging.DEBUG)
58+
logging.getLogger("azure.core.pipeline.policies.http_logging_policy").setLevel(logging.DEBUG)
59+
```
60+
61+
### 2. Raw HTTP wire data (request/response headers)
62+
63+
Enable `http.client` debug logging to see the raw HTTP wire protocol, including request and response headers:
64+
65+
```python
66+
import http.client
67+
http.client.HTTPConnection.debuglevel = 1
68+
```
69+
70+
Note: Response bodies will typically not be visible at this level because Azure SDKs use gzip compression, and `http.client` logs the raw compressed bytes.
71+
72+
### 3. Decompressed response bodies
73+
74+
To see actual response bodies, monkey-patch the Azure SDK's `HttpLoggingPolicy.on_response` method. This works because `response.http_response.body()` returns the decompressed content:
75+
76+
```python
77+
from azure.core.pipeline.policies import HttpLoggingPolicy
78+
79+
_original_on_response = HttpLoggingPolicy.on_response
80+
81+
def _on_response_with_body(self, request, response):
82+
_original_on_response(self, request, response)
83+
try:
84+
body = response.http_response.body()
85+
if body:
86+
_logger = logging.getLogger("azure.core.pipeline.policies.http_logging_policy")
87+
_logger.debug("Response body: %s", body[:4096].decode("utf-8", errors="replace"))
88+
except Exception:
89+
pass
90+
91+
HttpLoggingPolicy.on_response = _on_response_with_body
92+
```

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,12 @@ This project includes infrastructure as code (IaC) to provision Azure OpenAI dep
139139
azd auth login --use-device-code
140140
```
141141
142+
If you are using a tenant besides the default tenant, you may need to also login with Azure CLI to that tenant:
143+
144+
```shell
145+
az login --tenant your-tenant-id
146+
```
147+
142148
3. Provision the OpenAI account:
143149
144150
```shell

examples/agent_evaluation.py

Lines changed: 30 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,9 @@
22
import json
33
import logging
44
import os
5-
import tempfile
65
from typing import Annotated
76

8-
from agent_framework import ChatAgent, tool
7+
from agent_framework import Agent, tool
98
from agent_framework.openai import OpenAIChatClient
109
from azure.ai.evaluation import (
1110
AzureOpenAIModelConfiguration,
@@ -14,7 +13,6 @@
1413
ResponseCompletenessEvaluator,
1514
TaskAdherenceEvaluator,
1615
ToolCallAccuracyEvaluator,
17-
evaluate,
1816
)
1917
from azure.identity.aio import DefaultAzureCredential, get_bearer_token_provider
2018
from dotenv import load_dotenv
@@ -68,10 +66,6 @@
6866
model=os.environ.get("OPENAI_MODEL", "gpt-5-mini"),
6967
)
7068

71-
# Optional: Set AZURE_AI_PROJECT in .env to log results to Azure AI Foundry.
72-
# Example: https://your-account.services.ai.azure.com/api/projects/your-project
73-
AZURE_AI_PROJECT = os.getenv("AZURE_AI_PROJECT")
74-
7569

7670
@tool
7771
def get_weather(
@@ -183,9 +177,8 @@ def estimate_budget(
183177
"within the user's budget. Include weather information to help with packing."
184178
)
185179

186-
agent = ChatAgent(
187-
name="travel-planner",
188-
chat_client=client,
180+
agent = Agent(
181+
client=client,
189182
instructions=AGENT_INSTRUCTIONS,
190183
tools=tools,
191184
)
@@ -269,7 +262,7 @@ def display_evaluation_results(results: dict[str, dict]) -> None:
269262

270263

271264
async def main():
272-
query = "Plan a 3-day trip from New York to Tokyo next month on a $2000 budget. I like hiking and museums."
265+
query = "Plan a 3-day trip from New York (JFK) to Tokyo, departing March 15 and returning March 18, 2026. My budget is $2000 total. I like hiking and museums. Please search for flights, hotels under $150/night, check the weather, and suggest activities."
273266

274267
logger.info("Running travel planner agent...")
275268
response = await agent.run(query)
@@ -298,94 +291,35 @@ async def main():
298291
"ToolCallAccuracy": "tool_call_accuracy",
299292
}
300293

301-
if AZURE_AI_PROJECT:
302-
logger.info(f"Logging evaluation results to Azure AI project: {AZURE_AI_PROJECT}")
294+
intent_evaluator = IntentResolutionEvaluator(**evaluator_kwargs)
295+
completeness_evaluator = ResponseCompletenessEvaluator(**evaluator_kwargs)
296+
adherence_evaluator = TaskAdherenceEvaluator(**evaluator_kwargs)
297+
tool_accuracy_evaluator = ToolCallAccuracyEvaluator(**evaluator_kwargs)
303298

304-
eval_data_row = {
305-
"query": eval_query,
306-
"response": eval_response,
307-
"response_text": response.text,
308-
"ground_truth": ground_truth,
309-
"tool_definitions": tool_definitions,
310-
}
299+
intent_result = intent_evaluator(query=eval_query, response=eval_response, tool_definitions=tool_definitions)
300+
completeness_result = completeness_evaluator(response=response.text, ground_truth=ground_truth)
301+
adherence_result = adherence_evaluator(
302+
query=eval_query, response=eval_response, tool_definitions=tool_definitions
303+
)
304+
tool_accuracy_result = tool_accuracy_evaluator(
305+
query=eval_query, response=eval_response, tool_definitions=tool_definitions
306+
)
311307

312-
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False, encoding="utf-8") as f:
313-
f.write(json.dumps(eval_data_row) + "\n")
314-
eval_data_file = f.name
315-
316-
try:
317-
eval_result = evaluate(
318-
data=eval_data_file,
319-
evaluation_name="travel-planner-agent-eval",
320-
evaluators={
321-
"intent_resolution": IntentResolutionEvaluator(**evaluator_kwargs),
322-
"response_completeness": ResponseCompletenessEvaluator(**evaluator_kwargs),
323-
"task_adherence": TaskAdherenceEvaluator(**evaluator_kwargs),
324-
"tool_call_accuracy": ToolCallAccuracyEvaluator(**evaluator_kwargs),
325-
},
326-
# ResponseCompletenessEvaluator expects a plain text response, not a message list,
327-
# so we override its column mapping to use response_text and ground_truth.
328-
# Other evaluators auto-map correctly since data keys match param names.
329-
evaluator_config={
330-
"response_completeness": {
331-
"column_mapping": {
332-
"response": "${data.response_text}",
333-
"ground_truth": "${data.ground_truth}",
334-
}
335-
},
336-
},
337-
azure_ai_project=AZURE_AI_PROJECT,
338-
)
339-
340-
# Parse results from the batch evaluate() output
341-
evaluation_results = {}
342-
rows = eval_result.get("rows", [])
343-
row = rows[0] if rows else {}
344-
345-
for display_name, key in result_keys.items():
346-
evaluation_results[display_name] = {
347-
"score": row.get(f"outputs.{key}.{key}", "N/A"),
348-
"result": row.get(f"outputs.{key}.{key}_result", "N/A"),
349-
"reason": row.get(f"outputs.{key}.{key}_reason", "N/A"),
350-
}
351-
352-
display_evaluation_results(evaluation_results)
353-
354-
studio_url = eval_result.get("studio_url")
355-
if studio_url:
356-
print(f"\n[bold blue]View results in Azure AI Foundry:[/bold blue] {studio_url}")
357-
finally:
358-
os.unlink(eval_data_file)
359-
else:
360-
intent_evaluator = IntentResolutionEvaluator(**evaluator_kwargs)
361-
completeness_evaluator = ResponseCompletenessEvaluator(**evaluator_kwargs)
362-
adherence_evaluator = TaskAdherenceEvaluator(**evaluator_kwargs)
363-
tool_accuracy_evaluator = ToolCallAccuracyEvaluator(**evaluator_kwargs)
364-
365-
intent_result = intent_evaluator(query=eval_query, response=eval_response, tool_definitions=tool_definitions)
366-
completeness_result = completeness_evaluator(response=response.text, ground_truth=ground_truth)
367-
adherence_result = adherence_evaluator(
368-
query=eval_query, response=eval_response, tool_definitions=tool_definitions
369-
)
370-
tool_accuracy_result = tool_accuracy_evaluator(
371-
query=eval_query, response=eval_response, tool_definitions=tool_definitions
372-
)
308+
evaluation_results = {}
309+
for name, result in [
310+
("IntentResolution", intent_result),
311+
("ResponseCompleteness", completeness_result),
312+
("TaskAdherence", adherence_result),
313+
("ToolCallAccuracy", tool_accuracy_result),
314+
]:
315+
key = result_keys[name]
316+
evaluation_results[name] = {
317+
"score": result.get(key, "N/A"),
318+
"result": result.get(f"{key}_result", "N/A"),
319+
"reason": result.get(f"{key}_reason", result.get("error_message", "N/A")),
320+
}
373321

374-
evaluation_results = {}
375-
for name, result in [
376-
("IntentResolution", intent_result),
377-
("ResponseCompleteness", completeness_result),
378-
("TaskAdherence", adherence_result),
379-
("ToolCallAccuracy", tool_accuracy_result),
380-
]:
381-
key = result_keys[name]
382-
evaluation_results[name] = {
383-
"score": result.get(key, "N/A"),
384-
"result": result.get(f"{key}_result", "N/A"),
385-
"reason": result.get(f"{key}_reason", result.get("error_message", "N/A")),
386-
}
387-
388-
display_evaluation_results(evaluation_results)
322+
display_evaluation_results(evaluation_results)
389323

390324
if async_credential:
391325
await async_credential.close()

0 commit comments

Comments
 (0)