[ICML 2026] AutoControl Arena: Frontier AI Risk Auto-Discovery Platform
-
Updated
May 2, 2026 - Python
[ICML 2026] AutoControl Arena: Frontier AI Risk Auto-Discovery Platform
[CVPR'26] SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
[ICLR26] Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
[EMNLP 2024] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
[ICML 2025] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
Official Implementation of "PSAlign: Personalized Safety Alignment for Text-to-Image Diffusion Models"
This is the official implementation (code, data) of the paper "MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?""
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
Code for the paper: "Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution".
Adversarial testing of LLMs on constraint satisfaction deadlocks
[ICML 2026] Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
A PID-Lagrangian based Reinforcement Unlearning framework
Risk-Aware Introspective RAG (RAI-RAG) is a safety-aligned RAG framework integrating introspective reasoning, risk-aware retrieval gating, and secure evidence filtering to build trustworthy, robust, and secure LLM and agentic AI systems.
Adversarial Distillation at the Frontier: Technical Mechanisms, Documented Campaigns, and a Policy Framework for Protecting American AI Capabilities — arXiv cs.AI preprint
Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
Quantifying the "Safety Half-Life" of LLMs: A framework to measure how safety alignment degrades and susceptibility to jailbreaks increases as context length grows
Add a description, image, and links to the safety-alignment topic page so that developers can more easily learn about it.
To associate your repository with the safety-alignment topic, visit your repo's landing page and select "manage topics."