safety-alignment

Star

Here are 18 public repositories matching this topic...

CosmosYi / AutoControl-Arena

Star

[ICML 2026] AutoControl Arena: Frontier AI Risk Auto-Discovery Platform

agent alignment ai-safety ai-agents ai-agent safety-evaluation safety-alignment ai-safety-research

Updated May 2, 2026
Python

XuankunRong / SafeGRPO

Star

[CVPR'26] SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

reasoning safety-alignment grpo

Updated Feb 19, 2026
Python

declare-lab / ferret

Star

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

red-teaming llm safety-alignment

Updated Aug 22, 2024
Python

OPTML-Group / VLM-Safety-Unlearn

Star

[ICLR26] Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning

machine-unlearning trustworthy-machine-learning vision-language-model multimodal-large-language-models safety-alignment

Updated Apr 16, 2026
Python

jc-ryan / holistic_automated_red_teaming

Star

[EMNLP 2024] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction

large-language-models safety-alignment automated-red-teaming

Updated Nov 9, 2024
Python

ChanLiang / VAA

Star

[ICML 2025] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

ai-safety robustness fine-tuning post-training llm safety-alignment

Updated Jun 10, 2025
Python

jc-ryan / AgentAlign

Star

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

tool-use llm-agents safety-alignment

Updated Jun 2, 2025
Python

M-E-AGI-Lab / PSAlign

Star

Official Implementation of "PSAlign: Personalized Safety Alignment for Text-to-Image Diffusion Models"

text-to-image diffusion-models safety-alignment

Updated Jan 17, 2026
Python

turningpoint-ai / MOSSBench

Star

This is the official implementation (code, data) of the paper "MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?""

vlm mllm oversensitivity safety-alignment turningpoint-ai

Updated Nov 16, 2024
JavaScript

awsm-research / SEALGuard

Star

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

guardrails llm-safety-benchmark safety-alignment multilingual-safety-alignment

Updated Jul 12, 2025
Python

muteyaki / cemma

Star

Code for the paper: "Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution".

multimodal-large-language-models safety-alignment

Updated Mar 1, 2026
Python

YaswanthGhanta / llm-logical-integrity-benchmark

Star

Adversarial testing of LLMs on constraint satisfaction deadlocks

reinforcement-learning gemini grok claude hallucination prompt-engineering chain-of-thought chatgpt rlhf qwen llm-evaluation sycophancy deepseek safety-alignment ai-red-teaming kimi-k2 adversarial-testing

Updated Jan 27, 2026

robinzixuan / CRAFT

Star

[ICML 2026] Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

reinforcement-learning jailbreak defence representation-learning contrastive-learning safety-alignment grpo

Updated Mar 29, 2026
Python

Atomar04 / Controlled-Unlearning

Star

A PID-Lagrangian based Reinforcement Unlearning framework

safety-alignment unlearning-algorithm reinforcement-unlearning

Updated May 1, 2026
Python

Risk-Aware Introspective RAG (RAI-RAG) is a safety-aligned RAG framework integrating introspective reasoning, risk-aware retrieval gating, and secure evidence filtering to build trustworthy, robust, and secure LLM and agentic AI systems.

trustworthy-ai safety-alignment retrieval-augmented-generation-rag agentic-ai-security llm-security-compliance-prompt-injection ai-safety-research risk-aware-ai introspective-reasoning

Updated Mar 7, 2026
Python

web3guru888 / adversarial-distillation-paper

Star

Adversarial Distillation at the Frontier: Technical Mechanisms, Documented Campaigns, and a Policy Framework for Protecting American AI Capabilities — arXiv cs.AI preprint

arxiv research-paper knowledge-distillation ai-security ai-policy deepseek adversarial-distillation safety-alignment export-controls frontier-ai nstm4 us-china-ai

Updated Apr 26, 2026
TeX

anki079 / refusal-in-reasoning-models

Star

Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.

jailbreak language-models ai-safety llm mechanistic-interpretability transformerlens qwen safety-alignment reasoning-language-models reasoning-models deepseek-r1 refusal-ablation refusal-direction

Updated May 8, 2026
Python

Jason-Wang313 / Drift-Bench

Star

Quantifying the "Safety Half-Life" of LLMs: A framework to measure how safety alignment degrades and susceptibility to jailbreaks increases as context length grows

benchmarking machine-learning jailbreak evaluation transformers ai-safety model-drift llm context-window safety-alignment

Updated Feb 2, 2026
Python

Improve this page

Add a description, image, and links to the safety-alignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the safety-alignment topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safety-alignment

Here are 18 public repositories matching this topic...

CosmosYi / AutoControl-Arena

XuankunRong / SafeGRPO

declare-lab / ferret

OPTML-Group / VLM-Safety-Unlearn

jc-ryan / holistic_automated_red_teaming

ChanLiang / VAA

jc-ryan / AgentAlign

M-E-AGI-Lab / PSAlign

turningpoint-ai / MOSSBench

awsm-research / SEALGuard

muteyaki / cemma

YaswanthGhanta / llm-logical-integrity-benchmark

robinzixuan / CRAFT

Atomar04 / Controlled-Unlearning

Miraj-Rahman-AI / RAI-RAG

web3guru888 / adversarial-distillation-paper

anki079 / refusal-in-reasoning-models

Jason-Wang313 / Drift-Bench

Improve this page

Add this topic to your repo