DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

llm
research paper
Author

Santosh Sawant

Published

February 10, 2025

The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages.

To address this gap, researchers have proposed DuoGuard, a guardrail LLM trained with two-player reinforcement learning framework designed to enhance multilingual safeguard for large language models (LLMs). DuoGuard enables the co-evolution of a generator and a guardrail model adversarially to produce high-quality synthetic data for multilingual guardrail training. DuoGuard theoretically formalizes interaction as a two-player game, proving convergence to a Nash equilibrium.

Overall DuoGuard two-player training pipeline consists of a generator that produces synthetic data from seed data.Then the classifier makes predictions and measures these examples as being predicted correctly or incorrectly based on their seed data label. Finally the model is trained with a generator with DPO to create increasingly challenging examples, which in turn improve the classifier through iterative training.

Empirical evaluations show that model DuoGuard outperforms state-of-theart models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5× faster at inference with a significantly smaller model (0.5B). it achieves substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for low resource languages in a collected real dataset. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety.

Paper : DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails