Skip to content

Pull requests: strands-agents/evals

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Reviews
Assignee
Filter by who’s assigned
Assigned to nobody Loading
Sort

Pull requests list

fix(redteam): separate errored attacks from breaches in ASR area-redteam Red teaming: adversarial generation, attack strategies, attack success evaluation bug Something isn't working
#296 opened Jul 2, 2026 by kevmyung Contributor Loading…
7 of 9 tasks
fix(redteam): rename structured-output models off leading underscore area-redteam Red teaming: adversarial generation, attack strategies, attack success evaluation bug Something isn't working
#294 opened Jul 1, 2026 by kevmyung Contributor Loading…
8 of 9 tasks
fix(redteam): reset target session between PAIR/SequentialBreak iterations area-redteam Red teaming: adversarial generation, attack strategies, attack success evaluation bug Something isn't working
#292 opened Jul 1, 2026 by kevmyung Contributor Loading…
9 tasks done
feat(redteam): redesign risk categories for agent-centric evaluation area-redteam Red teaming: adversarial generation, attack strategies, attack success evaluation enhancement New feature or request
#290 opened Jul 1, 2026 by kevmyung Contributor Loading…
9 tasks done
ci: update opentelemetry-instrumentation-langchain requirement from <0.62.0,>=0.40.0 to >=0.40.0,<0.63.0 area-community Repo health, governance, contributor process, release process, and CI dependency bumps chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact dependencies Pull requests that update a dependency file python Pull requests that update python code
#288 opened Jun 29, 2026 by dependabot Bot Loading…
feat: map type labels to native issue type area-community Repo health, governance, contributor process, release process, and CI dependency bumps enhancement New feature or request
#287 opened Jun 26, 2026 by yonib05 Member Loading…
fix(ci): sha-pin third-party GitHub Actions area-community Repo health, governance, contributor process, release process, and CI dependency bumps chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact
#285 opened Jun 25, 2026 by max-rattray-aws Loading…
feat: add P1 model-output corruption effects area-chaos Chaos/fault injection: experiments, recovery strategy, partial completion, failure communication enhancement New feature or request
#284 opened Jun 24, 2026 by venkatkrish543re Loading…
1 of 9 tasks
ci: update langfuse requirement from <4,>=2.0.0 to >=2.0.0,<5 area-community Repo health, governance, contributor process, release process, and CI dependency bumps chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact dependencies Pull requests that update a dependency file python Pull requests that update python code
#283 opened Jun 24, 2026 by dependabot Bot Loading…
ci: update mypy requirement from <2.0.0,>=1.15.0 to >=1.15.0,<3.0.0 area-community Repo health, governance, contributor process, release process, and CI dependency bumps chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact dependencies Pull requests that update a dependency file python Pull requests that update python code
#282 opened Jun 24, 2026 by dependabot Bot Loading…
chore(detectors): added async detectors execution (WIP) area-detectors Failure detection and root cause analysis of agent sessions chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact
#277 opened Jun 17, 2026 by poshinchen Contributor Loading…
7 of 9 tasks
feat(experiment): add verbosity-aware stack traces to evaluation error reasons area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-devx Developer experience: papercuts, confusing public APIs, error messages, ergonomics, usability enhancement New feature or request
#268 opened Jun 15, 2026 by AndyMc629 Loading…
9 tasks done
Feature/skills aggregator: add SkillEvalAggregator for batch evaluation comparisons area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request
#229 opened May 14, 2026 by venkatkrish543re Draft
feat: Add EvaluationPlugin for agent invocation evaluation and retry area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request
#166 opened Mar 18, 2026 by afarntrog Contributor Loading…
5 of 7 tasks
feat: add OTel test semantic convention attributes to Experiment spans area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-tracing Trace/session ingestion: providers, session mappers, extractors, telemetry/OTEL enhancement New feature or request
#131 opened Feb 10, 2026 by anirudha Draft
feat: Optional Case specific Goal for GoalSuccessRateEvaluator area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request
#75 opened Dec 17, 2025 by dbermuehler Draft
7 tasks
feat: add ContextualFaithfulnessEvaluator area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request
#64 opened Dec 7, 2025 by stefanoamorelli Loading…
7 tasks done
Mapper for parsing langfuse traces to standard format area-tracing Trace/session ingestion: providers, session mappers, extractors, telemetry/OTEL enhancement New feature or request
#49 opened Nov 25, 2025 by deepakdalakoti Collaborator Loading…
6 tasks done
ProTip! Exclude everything labeled bug with -label:bug.