sDPO: Don’t Use Your Data All at Once

As development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important to ensure safety and usefulness of the model. Thus, reinforcement learning techniques such as proximal policy optimization (PPO) are key in this alignment phase, despite their complexity.

To resolve this complexity during LLM training direct preference optimization (DPO) has been widely used mainly for its simplicity and effectiveness. DPO involves curating preference datasets using human or strong AI (e.g., GPT4 ) judge to select chosen and rejected responses to questions. These datasets are then used to train LLMs by comparing log probabilities of chosen versus rejected answers. However, due to unavailability of log probabilities for proprietary models like GPT-4 the reference model is simply set as the base SFT model, which is a much weaker alternative with potentially misaligned preferences.

To address this researchers have proposed ‘stepwise DPO’, named sDPO, where preference datasets are divided to be used in multiple steps. The aligned model from the previous step is used as the reference and target models for the current step. The reference model is used to calculate the log probabilities and the target model is trained using the preference loss of DPO at each step.

Evaluating results of applying sDPO to the SFT base model such as pretrained-only ‘SOLAR 10.7B’ to the instruction-tuned ‘SOLAR 10.7B + SFT’, seen an increase of +5.24 in terms of H4. Applying sDPO on SOLAR 10.7B + SFT further increases the H4 score upto 74.31, an improvement of +4.80. Notably, ‘SOLAR 10.7B + SFT + sDPO’ outperforms other larger models such as Mixtral 8x7BInstruct-v0.1, despite the smaller number of parameters. This highlights that effective alignment tuning could be the key to unlocking next level performance for smaller LLMs.

Paper : https://arxiv.org/pdf/2403.19270.pdf