Support messages with images in prepare_multimodal_messages#5474
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 414deb9. Configure here.
sergiopaniego
left a comment
There was a problem hiding this comment.
Thanks for the update @albertvillanova!
I think that once the merge conflicts are resolved and the problem raised by Cursor in grpo_trainer.py is solved, we're good to go 😄
…th-images-prepare_multimodal_messages
…th-images-prepare_multimodal_messages
…th-images-prepare_multimodal_messages
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…th-images-prepare_multimodal_messages
f083467 to
24bcdd1
Compare

Support messages with images in
prepare_multimodal_messages.This PR enhances the handling of multimodal messages by ensuring that existing image payloads are preserved and only unfilled placeholders are populated, preventing accidental overwrites. Additionally, the test suite is expanded to cover this behavior, and prompt processing is streamlined in the trainer.
See related comment in:
environment_factoryfor VLM training #5323 (comment)CC: @sergiopaniego
Changes
Multimodal message handling improvements:
prepare_multimodal_messagesto preserve existing image payloads in image blocks and only fill placeholders without an"image"key.Testing enhancements:
test_prepared_image_blocks_without_new_images, to verify that existing image payloads are not overwritten when no new images are provided.Trainer integration:
prepare_multimodal_messages, ensuring consistent handling of multimodal content during tokenization.Note
Low Risk
Small, localized change to multimodal message preparation and a trainer call-site refactor; main risk is subtle behavioral differences in placeholder counting for edge-case message formats.
Overview
prepare_multimodal_messagesnow preserves existing image blocks that already carry an"image"payload, and only counts/fills unfilled{"type": "image"}placeholders from theimagesargument (avoiding accidental overwrites and placeholder-count mismatches).Adds a regression test ensuring prepared image payloads remain intact when
imagesis omitted, and simplifies GRPO trainer VLM prompt normalization by delegating toprepare_multimodal_messagesinstead of ad-hoc string wrapping.Reviewed by Cursor Bugbot for commit 24bcdd1. Bugbot is set up for automated code reviews on this repo. Configure here.