fix(voice): rebuild playback bridge after idle to stop slow new turns#1816
Conversation
Assistant playback routes through a MediaStreamAudioDestinationNode -> HTMLAudioElement bridge so a selected output device can be honored. That bridge is a live playout path, and reusing the element for a new turn after it sat idle between turns made the new turn resume at the wrong rate (audible as slow-motion that re-converges over the turn). Tear the bridge down and rebuild it once it has fully drained and been idle past a short threshold, so each turn plays through a fresh element. The scheduledSources.size === 0 guard keeps this from firing mid-turn. Adds regression tests covering within-turn reuse, after-idle rebuild, and sub-threshold reuse.
🦋 Changeset detectedLatest commit: 5ee11a8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
| this.#isPlaying = false; | ||
| this.#isScheduling = false; | ||
| this.#playbackCursor = 0; | ||
| this.#lastPlaybackEnd = null; |
There was a problem hiding this comment.
🚩 Bridge rebuild is skipped after playback interrupts, which may reproduce the same slow-playback symptom
After a playback_interrupt or user-transcript interrupt, #stopPlayback() resets #lastPlaybackEnd to null (voice-client.ts:1003). This means the idle rebuild condition at voice-client.ts:928 can never fire for the next turn, because the third guard (this.#lastPlaybackEnd !== null) is false. The bridge stays alive.
If the HTMLAudioElement slow-playback issue also manifests when the bridge goes idle after an abrupt source.stop() (interrupt) rather than a natural drain, then the same audible symptom could reappear after: interrupt → long pause → new assistant turn.
This may be intentional — interrupts typically lead to new audio quickly, and the abrupt stop may not trigger the same internal buffering drift. But it's worth verifying empirically that the fix covers the interrupt-then-long-pause scenario.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch, and it's a real gap rather than intentional. An interrupt leaves the bridge idle the same way a natural drain does, so interrupt -> long pause -> new turn could reproduce the slow playback. Resetting #lastPlaybackEnd to null in #stopPlayback defeated the rebuild guard for that next turn.
Fixed in 5ee11a8: instead of nulling it, #stopPlayback now records the audio-clock time at the interrupt as the idle start (#lastPlaybackEnd = this.#audioContext?.currentTime ?? null), so the next turn's rebuild check still fires. Using the interrupt time rather than the cut chunk's scheduled end matters too, since the chunk was stopped early. Added a regression test covering interrupt -> idle gap -> new turn.
#stopPlayback reset #lastPlaybackEnd to null, which disabled the post-idle bridge rebuild for the next turn (the guard requires a non-null value). An interrupt leaves the bridge idle just like a natural drain, so interrupt -> long pause -> new turn could reproduce the slow playback. Record the audio clock time at interrupt as the idle start instead, so the next turn still rebuilds. Adds a regression test for the interrupt-then-idle path.
threepointone
left a comment
There was a problem hiding this comment.
nice catch! land when you feel ready, want this to go out today.
This PR fixes assistant speech playing back slow on a new turn after an idle gap in
@cloudflare/voice. It was reported against a booth demo built on the voice agent: ask two separate questions with a pause between them and the second reply plays in slow motion, then re-converges to normal speed over the turn.Why
VoiceClientroutes assistant playback through aMediaStreamAudioDestinationNode->HTMLAudioElementbridge so a selected output device can be honored viaHTMLMediaElement.setSinkId. That bridge is a live playout path with its own clock.ctx.destination(skipping the bridge) was also considered and rejected here: the existing design intentionally routes all playback through the bridge so runtimesetOutputDevice()keeps working, and the per-turn rebuild fixes every sink on its own. That lower-latency routing can be a separate change.Code Changes
packages/voice/src/voice-client.ts: track the end time of the last scheduled chunk in#lastPlaybackEnd. In#playAudio, before resolving the destination, tear the bridge down if it exists, has fully drained (#scheduledSources.size === 0), and has been idle past 0.3s;#getPlaybackDestinationthen builds a fresh element for the turn. The drained guard means this never fires mid-turn, since chunks within a turn keep a source scheduled on the cursor.#lastPlaybackEndis reset in#stopPlayback.