Skip to content

fix(chat): guard tool-output-denied clobber + harden nightly e2e jobs#1818

Merged
threepointone merged 5 commits into
mainfrom
fix/tool-output-denied-clobber-approval
Jun 26, 2026
Merged

fix(chat): guard tool-output-denied clobber + harden nightly e2e jobs#1818
threepointone merged 5 commits into
mainfrom
fix/tool-output-denied-clobber-approval

Conversation

@threepointone

@threepointone threepointone commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes one real SDK bug surfaced by the nightly e2e suite and hardens three nightly e2e jobs that were failing or hanging in CI. The work came out of debugging three distinct nightly failures.

1. fix(chat): don't let tool-output-denied clobber an approved/terminal tool part

applyChunkToParts in packages/agents/src/chat/message-builder.ts unconditionally flipped a tool part to output-denied on a tool-output-denied chunk. During an auto-continuation that re-validates the transcript, the AI SDK can legitimately emit tool-output-denied for an approval it now deems unneeded (e.g. a tool without needsApproval). That silently turned a granted approval (approval-responded) — or an already-settled part — into a denial in the persisted message.

The handler is now first-write-wins: a part already in output-available / output-error / output-denied / approval-responded is left untouched. This matches the guards already on the tool-input-* handlers and benefits both @cloudflare/ai-chat and @cloudflare/think. Ships as a patch changeset (agents).

To exercise a realistic approval flow, the ai-chat e2e gains a dedicated ClientToolApprovalAgent (with needsApproval: true) so the approval test is isolated from the plain getUserLocation client-tool test. New unit tests cover the guard for both approval-responded and terminal output-available parts.

2. ci: stop the browser-connector nightly from looping on a corrupted Chrome cache

The E2E: agents (browser connector) job kept hitting its 30-minute cap. Root cause: wrangler dev's Browser Rendering simulator repeatedly hit a corrupted Chrome cache caused by an extract-zip bug on Node 24.16+. Wrangler would clear + re-download, the Node bug re-corrupted it, and the loop exhausted the job.

  • Pin the job to Node 24.15.0 (via a new node-version input on the shared install action).
  • Cache ~/.cache/.wrangler/chrome to avoid repeated downloads.
  • bail: 1 in the browser-tests vitest config so the suite stops after the first full failure (incl. retries).
  • Wrap the run in timeout --kill-after=30s 20m as a wall-clock guard below the job cap.

3. ci(codemode): split e2e into offline + Workers-AI jobs

A wrangler bump (4.105.0) changed remote-binding init from lazy to eager, so wrangler dev now opens a remote proxy session at startup. The codemode e2e's ai: { remote: true } binding therefore needed Cloudflare credentials the deterministic CI job didn't have, failing at boot.

Mirrors the ai-chat split:

  • wrangler.mock.jsonc (no ai binding) + playwright.config.ts runs the deterministic executor.spec.ts — no credentials needed (e2e-codemode).
  • playwright.llm.config.ts (remote ai binding, higher timeouts, CI retries) runs codemode.spec.ts + llm-codemode.spec.ts against real Workers AI in a new credentialed e2e-codemode-llm job.

Test plan

  • pnpm run typecheck — all 113 projects pass
  • ai-chat guard unit tests (client-tool-duplicate-message.test.ts) — 37/37 pass
  • LLM codemode e2e (test:e2e:llm) — 8/8 pass against real Workers AI locally
  • Nightly: e2e-agents-browser completes without looping on Chrome cache
  • Nightly: e2e-codemode (deterministic) runs without credentials
  • Nightly: e2e-codemode-llm authenticates and passes
  • Nightly: ai-chat tool-approval e2e passes against ClientToolApprovalAgent

Made with Cursor


Open in Devin Review

threepointone and others added 3 commits June 26, 2026 07:19
…tool part

An autoContinue continuation re-validates the transcript through the AI SDK,
which denies an approval for a tool that doesn't declare needsApproval and emits
a tool-output-denied chunk. applyChunkToParts then unconditionally flipped the
persisted part to output-denied, silently turning a granted approval into a
denial.

Make the tool-output-denied handler first-write-wins: preserve parts already in
output-available / output-error / output-denied, and never regress an
approval-responded (user-approved) part. Matches the guards already on the
tool-input-* handlers; benefits both @cloudflare/ai-chat and @cloudflare/think.

Also harden the ai-chat e2e: add a dedicated ClientToolApprovalAgent whose tool
declares needsApproval so the approval auto-continuation test exercises a real
approval flow (and assert the SDK-issued approvalId), leaving the existing
tool-result flow tests on the unchanged ClientToolAgent untouched.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rome cache

The e2e-agents-browser job was cancelled at the 30m cap (no test failures
reported) because Wrangler's local Browser Rendering simulator downloads its own
Chrome via @puppeteer/browsers, whose extract-zip dependency silently aborts
mid-extraction on Node 24.16+ (puppeteer/puppeteer#14957). The partial cache
made Wrangler "clear and retry" the install on every test, and with retry:3 + a
120s test timeout the suite re-corrupted and re-downloaded in a loop until the
job timed out.

Fix it in layers:
- Pin the job to Node 24.15.0 (pre-bug) via a new node-version input on the
  shared install action — the root cause. Other jobs keep the latest 24.x.
- Cache Wrangler's ~/.cache/.wrangler/chrome so the one-time download is reused
  and stays out of the timed test window.
- bail:1 in the browser vitest config + a 20m wall-clock guard on the step, so a
  future environmental stall fails fast WITH logs instead of a silent 30m cancel
  (mirrors the Playwright suites' globalTimeout convention).

Co-authored-by: Cursor <cursoragent@cursor.com>
The e2e-codemode nightly job ran with no Cloudflare credentials, but the e2e
wrangler.jsonc declares a remote `ai` binding. Wrangler 4.105.0 (recent bump)
eagerly opens a remote proxy session at `wrangler dev` startup for any
remote: true binding, which needs credentials — so the dev server exited 1 and
the whole Playwright run failed before any test, with no actionable report.
Older Wrangler connected the binding lazily, so the offline executor specs that
never call AI used to boot fine.

Finish the split that ai-chat already has:
- wrangler.mock.jsonc: AI-free config for the offline executor.spec.ts.
- playwright.config.ts: deterministic suite (executor.spec.ts) on the mock
  config — stays the credential-free gating e2e-codemode job.
- playwright.llm.config.ts + test:e2e:llm: the Workers-AI specs (codemode.spec.ts
  via /run, llm-codemode.spec.ts via /run-multi) on wrangler.jsonc, with a
  globalTimeout and CI retries, run by a new e2e-codemode-llm job that gets
  CLOUDFLARE_* credentials.

Verified: the deterministic suite passes 21/21 offline with CI=1 and no creds.
Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot

changeset-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 2a27101

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

Open in Devin Review

Comment thread packages/agents/src/chat/message-builder.ts
Comment thread packages/agents/src/chat/message-builder.ts
Comment thread packages/ai-chat/e2e/client-tools.spec.ts Outdated
@pkg-pr-new

pkg-pr-new Bot commented Jun 26, 2026

Copy link
Copy Markdown

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1818

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1818

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1818

create-think

npm i https://pkg.pr.new/create-think@1818

hono-agents

npm i https://pkg.pr.new/hono-agents@1818

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1818

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1818

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1818

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1818

commit: 2a27101

threepointone and others added 2 commits June 26, 2026 08:01
…tool-approval-request

The server-side applyChunkToParts guard only protected the persisted message;
a replayed tool-output-denied still flowed through _storeStreamChunk +
_broadcastChatMessage, where AI SDK v6's in-place updateToolPart would regress
an approval-responded/terminal tool part to output-denied on the client (and
replay it on reconnect). Extend isReplayChunk to drop tool-output-denied for
parts already settled or approval-responded, mirroring the apply-side guard.

Also add the missing first-write-wins guard to the tool-approval-request
handler: a continuation replaying a prior tool round-trip could regress an
already-approval-responded (or settled) part back to approval-requested,
discarding the user's decision.

Harden the approval e2e to send CF_AGENT_TOOL_APPROVAL on tool-approval-request
(carrying the real approvalId) rather than racing it on tool-input-available.

Adds unit coverage for both guards.

Co-authored-by: Cursor <cursoragent@cursor.com>
…y buffer

Same broadcast-path gap as tool-output-denied: the applyChunkToParts guard only
protects the persisted message, so a replayed tool-approval-request still flowed
through _storeStreamChunk + _broadcastChatMessage and reverted an approved tool
back to approval-requested on the client (re-showing Approve/Reject), replaying
on reconnect. Extend isReplayChunk to drop tool-approval-request for parts
already approval-responded or settled, mirroring the tool-output-denied branch.

Adds unit coverage for the new isReplayChunk branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
@threepointone threepointone merged commit fa7bfec into main Jun 26, 2026
7 checks passed
@threepointone threepointone deleted the fix/tool-output-denied-clobber-approval branch June 26, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant