Skip to content

FEAT: Add SGXSTest dataset loader#1754

Open
romanlutz wants to merge 3 commits into
microsoft:mainfrom
romanlutz:romanlutz/add-sgxstest-dataset
Open

FEAT: Add SGXSTest dataset loader#1754
romanlutz wants to merge 3 commits into
microsoft:mainfrom
romanlutz:romanlutz/add-sgxstest-dataset

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Description

Adds a remote seed dataset loader for SGXSTest (Singapore eXaggerated Safety Test), a 200-prompt benchmark of safe/unsafe prompt pairs that probes over-refusal behavior of LLMs in Singaporean cultural context. It adapts the 10 hazard categories of Roettger et al.'s XSTest to homonyms, figurative language, safe targets/contexts, definitions, discrimination, historical events, and privacy variants.

The dataset is HuggingFace-gated, so the loader mirrors the existing _SorryBenchDataset token pattern: the constructor accepts token: str | None = None and falls back to the HUGGINGFACE_TOKEN environment variable. The class docstring documents the gating requirement.

Because consumers typically only want one side of the pair, the loader exposes a SGXSTestLabel enum (UNSAFE, SAFE, ALL) on the constructor and defaults to UNSAFE so red-teaming flows don't have to post-filter. The enum is validated via the base class's _validate_enum helper, matching the VLGuardSubset pattern. An empty result after filtering raises ValueError, matching _SorryBenchDataset. Per-prompt metadata["label"] and metadata["category"] are preserved so users can still slice the data after loading.

Verified live against the HuggingFace dataset: default returns 100 unsafe prompts, SAFE returns 100 safe, ALL returns the full 200. The class-level harm_categories list mirrors the 10 actual category values from the live data (lower-cased to match PyRIT's tag normalization).

Tests and Documentation

  • New unit tests in tests/unit/datasets/test_sgxstest_dataset.py cover the three filter modes, empty-after-filter raise, invalid-label raise, token + split forwarding, env-var fallback, explicit-token override, and dataset_name. All 9 tests pass; the broader tests/unit/datasets suite (431 tests) is green.
  • doc/references.bib and doc/bibliography.md get an entry for the WalledEval paper (@gupta2024walledeval) used in the loader's attribution.
  • No notebook changes, so no JupyText run needed.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Roman Lutz and others added 3 commits May 18, 2026 07:08
Adds a loader for the walledai/SGXSTest dataset (200 prompts, Singaporean exaggerated-safety pairs from WalledEval), wires it into the remote dataset package, registers the WalledEval citation, and covers it with unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Live fetch revealed 10 distinct categories (e.g. 'Homonym', 'Privacy (fiction)', 'Real discrimination, nonsense group') rather than the 9 approximate names in the original spec. Updates the class-level harm_categories list, docstring, and test fixture to mirror the real data.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an SGXSTestLabel enum (UNSAFE, SAFE, ALL) and a label constructor parameter to _SGXSTestDataset, defaulting to UNSAFE so red-teaming consumers get just the 100 truly-harmful prompts. Filtering happens during seed construction; an empty result raises ValueError. The upstream dataset only publishes a 'train' split, so the split parameter is retained but documented as such.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant