FEAT: Add SGXSTest dataset loader#1754
Open
romanlutz wants to merge 3 commits into
Open
Conversation
Adds a loader for the walledai/SGXSTest dataset (200 prompts, Singaporean exaggerated-safety pairs from WalledEval), wires it into the remote dataset package, registers the WalledEval citation, and covers it with unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Live fetch revealed 10 distinct categories (e.g. 'Homonym', 'Privacy (fiction)', 'Real discrimination, nonsense group') rather than the 9 approximate names in the original spec. Updates the class-level harm_categories list, docstring, and test fixture to mirror the real data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an SGXSTestLabel enum (UNSAFE, SAFE, ALL) and a label constructor parameter to _SGXSTestDataset, defaulting to UNSAFE so red-teaming consumers get just the 100 truly-harmful prompts. Filtering happens during seed construction; an empty result raises ValueError. The upstream dataset only publishes a 'train' split, so the split parameter is retained but documented as such. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a remote seed dataset loader for SGXSTest (Singapore eXaggerated Safety Test), a 200-prompt benchmark of safe/unsafe prompt pairs that probes over-refusal behavior of LLMs in Singaporean cultural context. It adapts the 10 hazard categories of Roettger et al.'s XSTest to homonyms, figurative language, safe targets/contexts, definitions, discrimination, historical events, and privacy variants.
The dataset is HuggingFace-gated, so the loader mirrors the existing
_SorryBenchDatasettoken pattern: the constructor acceptstoken: str | None = Noneand falls back to theHUGGINGFACE_TOKENenvironment variable. The class docstring documents the gating requirement.Because consumers typically only want one side of the pair, the loader exposes a
SGXSTestLabelenum (UNSAFE,SAFE,ALL) on the constructor and defaults toUNSAFEso red-teaming flows don't have to post-filter. The enum is validated via the base class's_validate_enumhelper, matching theVLGuardSubsetpattern. An empty result after filtering raisesValueError, matching_SorryBenchDataset. Per-promptmetadata["label"]andmetadata["category"]are preserved so users can still slice the data after loading.Verified live against the HuggingFace dataset: default returns 100 unsafe prompts,
SAFEreturns 100 safe,ALLreturns the full 200. The class-levelharm_categorieslist mirrors the 10 actual category values from the live data (lower-cased to match PyRIT's tag normalization).Tests and Documentation
tests/unit/datasets/test_sgxstest_dataset.pycover the three filter modes, empty-after-filter raise, invalid-label raise, token + split forwarding, env-var fallback, explicit-token override, anddataset_name. All 9 tests pass; the broadertests/unit/datasetssuite (431 tests) is green.doc/references.bibanddoc/bibliography.mdget an entry for the WalledEval paper (@gupta2024walledeval) used in the loader's attribution.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com