fix(chat): guard tool-output-denied clobber + harden nightly e2e jobs#1818
Merged
Conversation
…tool part An autoContinue continuation re-validates the transcript through the AI SDK, which denies an approval for a tool that doesn't declare needsApproval and emits a tool-output-denied chunk. applyChunkToParts then unconditionally flipped the persisted part to output-denied, silently turning a granted approval into a denial. Make the tool-output-denied handler first-write-wins: preserve parts already in output-available / output-error / output-denied, and never regress an approval-responded (user-approved) part. Matches the guards already on the tool-input-* handlers; benefits both @cloudflare/ai-chat and @cloudflare/think. Also harden the ai-chat e2e: add a dedicated ClientToolApprovalAgent whose tool declares needsApproval so the approval auto-continuation test exercises a real approval flow (and assert the SDK-issued approvalId), leaving the existing tool-result flow tests on the unchanged ClientToolAgent untouched. Co-authored-by: Cursor <cursoragent@cursor.com>
…rome cache The e2e-agents-browser job was cancelled at the 30m cap (no test failures reported) because Wrangler's local Browser Rendering simulator downloads its own Chrome via @puppeteer/browsers, whose extract-zip dependency silently aborts mid-extraction on Node 24.16+ (puppeteer/puppeteer#14957). The partial cache made Wrangler "clear and retry" the install on every test, and with retry:3 + a 120s test timeout the suite re-corrupted and re-downloaded in a loop until the job timed out. Fix it in layers: - Pin the job to Node 24.15.0 (pre-bug) via a new node-version input on the shared install action — the root cause. Other jobs keep the latest 24.x. - Cache Wrangler's ~/.cache/.wrangler/chrome so the one-time download is reused and stays out of the timed test window. - bail:1 in the browser vitest config + a 20m wall-clock guard on the step, so a future environmental stall fails fast WITH logs instead of a silent 30m cancel (mirrors the Playwright suites' globalTimeout convention). Co-authored-by: Cursor <cursoragent@cursor.com>
The e2e-codemode nightly job ran with no Cloudflare credentials, but the e2e wrangler.jsonc declares a remote `ai` binding. Wrangler 4.105.0 (recent bump) eagerly opens a remote proxy session at `wrangler dev` startup for any remote: true binding, which needs credentials — so the dev server exited 1 and the whole Playwright run failed before any test, with no actionable report. Older Wrangler connected the binding lazily, so the offline executor specs that never call AI used to boot fine. Finish the split that ai-chat already has: - wrangler.mock.jsonc: AI-free config for the offline executor.spec.ts. - playwright.config.ts: deterministic suite (executor.spec.ts) on the mock config — stays the credential-free gating e2e-codemode job. - playwright.llm.config.ts + test:e2e:llm: the Workers-AI specs (codemode.spec.ts via /run, llm-codemode.spec.ts via /run-multi) on wrangler.jsonc, with a globalTimeout and CI retries, run by a new e2e-codemode-llm job that gets CLOUDFLARE_* credentials. Verified: the deterministic suite passes 21/21 offline with CI=1 and no creds. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 2a27101 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
…tool-approval-request The server-side applyChunkToParts guard only protected the persisted message; a replayed tool-output-denied still flowed through _storeStreamChunk + _broadcastChatMessage, where AI SDK v6's in-place updateToolPart would regress an approval-responded/terminal tool part to output-denied on the client (and replay it on reconnect). Extend isReplayChunk to drop tool-output-denied for parts already settled or approval-responded, mirroring the apply-side guard. Also add the missing first-write-wins guard to the tool-approval-request handler: a continuation replaying a prior tool round-trip could regress an already-approval-responded (or settled) part back to approval-requested, discarding the user's decision. Harden the approval e2e to send CF_AGENT_TOOL_APPROVAL on tool-approval-request (carrying the real approvalId) rather than racing it on tool-input-available. Adds unit coverage for both guards. Co-authored-by: Cursor <cursoragent@cursor.com>
…y buffer Same broadcast-path gap as tool-output-denied: the applyChunkToParts guard only protects the persisted message, so a replayed tool-approval-request still flowed through _storeStreamChunk + _broadcastChatMessage and reverted an approved tool back to approval-requested on the client (re-showing Approve/Reject), replaying on reconnect. Extend isReplayChunk to drop tool-approval-request for parts already approval-responded or settled, mirroring the tool-output-denied branch. Adds unit coverage for the new isReplayChunk branch. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes one real SDK bug surfaced by the nightly e2e suite and hardens three nightly e2e jobs that were failing or hanging in CI. The work came out of debugging three distinct nightly failures.
1.
fix(chat): don't lettool-output-deniedclobber an approved/terminal tool partapplyChunkToPartsinpackages/agents/src/chat/message-builder.tsunconditionally flipped a tool part tooutput-deniedon atool-output-deniedchunk. During an auto-continuation that re-validates the transcript, the AI SDK can legitimately emittool-output-deniedfor an approval it now deems unneeded (e.g. a tool withoutneedsApproval). That silently turned a granted approval (approval-responded) — or an already-settled part — into a denial in the persisted message.The handler is now first-write-wins: a part already in
output-available/output-error/output-denied/approval-respondedis left untouched. This matches the guards already on thetool-input-*handlers and benefits both@cloudflare/ai-chatand@cloudflare/think. Ships as a patch changeset (agents).To exercise a realistic approval flow, the ai-chat e2e gains a dedicated
ClientToolApprovalAgent(withneedsApproval: true) so the approval test is isolated from the plaingetUserLocationclient-tool test. New unit tests cover the guard for bothapproval-respondedand terminaloutput-availableparts.2.
ci: stop the browser-connector nightly from looping on a corrupted Chrome cacheThe
E2E: agents (browser connector)job kept hitting its 30-minute cap. Root cause:wrangler dev's Browser Rendering simulator repeatedly hit a corrupted Chrome cache caused by anextract-zipbug on Node 24.16+. Wrangler would clear + re-download, the Node bug re-corrupted it, and the loop exhausted the job.24.15.0(via a newnode-versioninput on the shared install action).~/.cache/.wrangler/chrometo avoid repeated downloads.bail: 1in the browser-tests vitest config so the suite stops after the first full failure (incl. retries).timeout --kill-after=30s 20mas a wall-clock guard below the job cap.3.
ci(codemode): split e2e into offline + Workers-AI jobsA wrangler bump (4.105.0) changed remote-binding init from lazy to eager, so
wrangler devnow opens a remote proxy session at startup. The codemode e2e'sai: { remote: true }binding therefore needed Cloudflare credentials the deterministic CI job didn't have, failing at boot.Mirrors the ai-chat split:
wrangler.mock.jsonc(noaibinding) +playwright.config.tsruns the deterministicexecutor.spec.ts— no credentials needed (e2e-codemode).playwright.llm.config.ts(remoteaibinding, higher timeouts, CI retries) runscodemode.spec.ts+llm-codemode.spec.tsagainst real Workers AI in a new credentialede2e-codemode-llmjob.Test plan
pnpm run typecheck— all 113 projects passclient-tool-duplicate-message.test.ts) — 37/37 passtest:e2e:llm) — 8/8 pass against real Workers AI locallye2e-agents-browsercompletes without looping on Chrome cachee2e-codemode(deterministic) runs without credentialse2e-codemode-llmauthenticates and passesClientToolApprovalAgentMade with Cursor