FIX: Use sequence=0 for both pieces in multimodal dataset loaders#1756
Open
romanlutz wants to merge 1 commit into
Open
FIX: Use sequence=0 for both pieces in multimodal dataset loaders#1756romanlutz wants to merge 1 commit into
romanlutz wants to merge 1 commit into
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Four multimodal remote dataset loaders were assigning
sequence=0to one piece (image or text) andsequence=1to the other while sharing the sameprompt_group_id. PerSeedPrompt.sequence(pyrit/models/seeds/seed_prompt.py:43-44), prompts are only grouped into a single multimodal user message when they share bothprompt_group_idandsequence. With mismatched sequences, the image and text were being delivered as two separate turns rather than as a single multimodal message, which defeats the purpose of these datasets (the model is supposed to reason over image + text together).This PR brings the four affected loaders in line with the correct pattern already used by
harmbench_multimodal_dataset.pyand the recently addedmsts_dataset.py: both pieces shareprompt_group_idandsequence=0.Loader changes (
pyrit/datasets/seed_datasets/remote/):vlguard_dataset.py- imagesequence=1->0.vlsu_multimodal_dataset.py- imagesequence=1->0.visual_leak_bench_dataset.py- textsequence=1->0. Reworded class andfetch_dataset_asyncdocstrings that described the old behavior.comic_jailbreak_dataset.py- textsequence=1->0. Rewordedfetch_dataset_asyncand_build_seed_groupdocstrings. TheSeedObjectivein the group is unchanged - only the image+text pair needs to sharesequence=0.Tests and Documentation
Updated the four corresponding unit tests under
tests/unit/datasets/to assert the new sharedsequence == 0for both pieces (one assertion change per file).uv run ruff format pyrit tests- cleanuv run ruff check pyrit tests- cleanuv run -m ty check pyrit/datasets/seed_datasets/remote- cleanuv run pytest tests/unit/datasets -q- 422 passedNo JupyText/doc changes needed (no docs reference these sequence numbers).