You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Summary
`VoxtralRealtimeRunner` was outputting excessive duplicate
tokens/gibberish on stream flush. In an audio file where I say "The
weather is clear today" run like
```
voxtral_realtime_runner \
--model_path model.pte \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--streaming \
--audio_path audio.wav
```
I would get output: `The weather is clear todayoday.</s>`
I also experienced this with periods and many other circumstances with
repeating tokens at the end of the stream.
Upon investigation into vLLM's (Mistrals recommended inferencing runner
for [Voxtral
Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602#vllm-recommended)),
I observed that vLLM finishes the stream by closing the streaming input
and draining model-defined right-padding audio, whereas ExecuTorch
`flush()` finished by switching into post-audio text-only decoding after
audio ended. See vLLM ref:
https://github.com/vllm-project/vllm/blob/2f9f946/vllm/model_executor/models/voxtral_realtime.py#L239-L270.
Therefore, apply similar logic here by converting the model-defined
transcription delay into a finite number of trailing silent streaming
steps to properly conclude the stream.
After this, the same command outputs:
```
The weather is clear today.
```
As I would expect. No `</s>` b/c like vLLM, the stream ends by naturally
draining the padded audio tail and letting the model emit whatever final
delayed text it wants.
### Test plan
Tested with above example and a few other audio files to observe the
behavior improvement and lack of giberrish/incorrect end of stream.
Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>
0 commit comments