Style over Substance: failure modes of LLM judges in alignment benchmarking.

llm
research paper
Author

Santosh Sawant

Published

September 27, 2024

Recently LLM-judge benchmarks such as MT-Bench, Alpaca Eval, and Arena-Hard-Auto have been a go to tool to simultaneously automate evaluation of LLMs while also aligning with human preference. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences.

So one may ask – do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not?

To answer this research have introduced SOS-BENCH, the largest standardized, reproducible LLM meta-benchmark to date. SOS-BENCH, a new alignment benchmark with ground truth, designed to gauge progress on alignment with helpful, honest, harmless (HHH) principles. SOS-BENCH combines 19 existing world knowledge, instruction following, and safety benchmarks for a holistic view of model performance

The LLM-judge pipeline is more complex than that of standard benchmarks, rather than relying on an objective ground truth, preference benchmarks substitute the preferences of a judge. This introduces new potential confounds: (1) the choice of judge, (2) the instructions to the judge, and (3) implicit biases which can affect a judge’s stated preferences independent of any instructions.

Experimentation was carried out on a series of post-trained LLAMA-3-8B base models, LLAMA-3 base without post-training, opt-125m, and several GPT checkpoints. The LLM-judge benchmark was ArenaHard-Auto, with standard settings, which uses GPT-4-0314 as a baseline model and GPT-4-1106-preview as a judge.

Finally, researcher have summarize their finding as follow (1) LLM judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors

Paper : https://arxiv.org/pdf/2409.15268