DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning

A typical training process for LLMs consists of three phases: (1) Pre-training: In this stage, LLMs are pre-trained on vast amounts of text and code to learn general-purpose knowledge. (2) Supervised Fine-tuning: In this stage, the model is fine-tuned on an instruction dataset and finally (3) Reinforcement Learning from Human Feedback (RLHF), where the model is trained based on human feedback.

So what’s so different about DeepSeek-R1? Well, to train DeepSeek-R1-Zero the supervised fine-tuning stage is completely omitted. To run reinforcement learning at a large scale, a rule-based reinforcement learning method is employed called Group Relative Policy Optimization (GRPO). Given a model to train and an input problem, the input is fed into the model, and a group of outputs is sampled. Each output consists of a reasoning process and an answer. The GRPO method observes these sampled outputs and trains the model to generate the preferred options by calculating a reward for each output using predefined rules: (1) Accuracy: One set of rules calculates an accuracy reward. (2) Format: Another type of rule creates format rewards.

This rule-based mechanism, which does not use a neural model to generate rewards, simplifies and reduces the cost of the training process, making it feasible at a large scale. Through reinforcement learning, the model naturally learns to allocate more thinking time when solving reasoning tasks. Amazingly, this occurs without any external adjustments and yes this is the `Aha Moment’ that the AI community is all discussing about.

However, DeepSeek-R1-Zero suffers from Readability Issues and Language Consistency. To overcome this, researchers have introduced DeepSeek-R1, which is trained on a four phase pipeline.

Cold Start (Phase 1): Incorporating a supervised fine-tuning phase on small, high-quality dataset helps DeepSeek-R1 mitigate the readability issues observed in the initial model. Reasoning Reinforcement Learning (Phase 2): This phase applies the same large-scale reinforcement learning that was reviewed for the previous model to enhance the model’s reasoning capabilities. Rejection Sampling and Supervised Fine-Tuning (Phase 3): In this phase only correct and readable samples are retained. Diverse Reinforcement Learning Phase (Phase 4): This final phase Rule-based rewards are utilized for tasks that allow that, such as math. For other tasks, a LLM provides feedback to align the model with human preferences.

In conclusion, a 32 billion parameters distilled model has demonstrated impressive performance, making it a viable smaller alternative with high reasoning capabilities and yes also triggering 108 Billion USD stock selloff :)

paper: DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning