DeepSeek is a Chinese AI company “dedicated to making AGI a reality” and open-sourcing all its models. They started in 2023, but have been making waves over the past month or so, and especially this past week with the release of their two latest reasoning models: DeepSeek-R1-Zero and the more advanced DeepSeek-R1, also known as DeepSeek Reasoner.
They’ve released not only the models but also the code and evaluation prompts for public use, along with a detailed paper outlining their approach.
Aside from creating 2 highly performant models that are on par with OpenAI’s o1 model, the paper has a lot of valuable information around reinforcement learning, chain of thought reasoning, prompt engineering with reasoning models, and more.
We’ll start by focusing on the training process of DeepSeek-R1-Zero, which uniquely relied solely on reinforcement learning, instead of traditional supervised learning. We’ll then move on to DeepSeek-R1, how it’s reasoning works, and some prompt engineering best practices for reasoning models.
Training DeepSeek-R1-Zero: A reinforcement learning-only approach
DeepSeek-R1-Zero stands out from most other state-of-the-art models because it was trained using only reinforcement learning (RL), no supervised fine-tuning (SFT). This challenges the current conventional approach and opens up new opportunities to train reasoning models with less human intervention and effort.
DeepSeek-R1-Zero is the first open-source model to validate that advanced reasoning capabilities can be developed purely through RL.
Without pre-labeled datasets, the model learns through trial and error, refining its behavior, parameters, and weights based solely on feedback from the solutions it generates.
DeepSeek-R1-Zero is the base model for DeepSeek-R1.
The RL process for DeepSeek-R1-Zero
The training process for DeepSeek-R1-Zero involved presenting the model with various reasoning tasks, ranging from math problems to abstract logic challenges. The model generated outputs and was evaluated based on its performance.
DeepSeek-R1-Zero received feedback through a reward system that helped guide its learning process:
- Accuracy rewards: Evaluates whether the output is correct. Used for when there are deterministic results (math problems).
- Format rewards: Encouraged the model to structure its reasoning within
<think>
and</think>
tags
Training prompt template
To train DeepSeek-R1-Zero to generate structured chain of thought sequences, the researchers used the following prompt training template, replacing {{prompt}}
with the reasoning question. You can access it in PromptHub here.
This template prompted the model to explicitly outline its thought process within <think>
tags before delivering the final answer in <answer>
tags.
The power of RL in reasoning
With this training process DeepSeek-R1-Zero began to produce sophisticated reasoning chains.
Through thousands of training steps, DeepSeek-R1-Zero evolved to solve increasingly complex problems. It learned to:
- Generate long reasoning chains that enabled deeper and more structured problem-solving
- Perform self-verification to cross-check its own answers (more on this later)
- Correct its own mistakes, showcasing emergent self-reflective behaviors
DeepSeek R1-Zero performance
While DeepSeek-R1-Zero is primarily a precursor to DeepSeek-R1, it still achieved high performance on several benchmarks. Let’s dive into some of the experiments ran.
Accuracy improvements during training
- Pass@1 accuracy began at 15.6% and by the end of the training it improved to 71.0%, comparable to OpenAI’s o1-0912 model
- The red solid line represents performance with majority voting (similar to ensembling and self-consistency techniques), which increased accuracy further to 86.7%, surpassing o1-0912
Next we’ll look at a table comparing DeepSeek-R1-Zero’s performance across multiple reasoning datasets against OpenAI’s reasoning models.
- AIME 2024: 71.0% Pass@1, slightly below o1-0912 but above o1-mini. 86.7% cons@64, beating both o1 and o1-mini
- MATH-500: Achieved 95.9%, beating both o1-0912 and o1-mini
- GPQA Diamond: Outperformed o1-mini with a score of 73.3%
- Performed much worse on coding tasks (CodeForces and LiveCode Bench)
Next we'll look at how the response length increased throughout the RL training process.
This graph shows the length of responses from the model as the training process progresses. Each “step” represents one cycle of the model’s learning process, where feedback is provided based on the output's performance, evaluated using the prompt template discussed earlier.
For each question (corresponding to one step), 16 responses were sampled, and the average accuracy was calculated to ensure stable evaluation.
As training progresses, the model generates longer reasoning chains, allowing it to solve increasingly complex reasoning tasks by leveraging more test-time compute.
While longer chains don’t always guarantee better results, they generally correlate with improved performance—a trend also observed in the MEDPROMPT paper (read more about it here) and in the original o1 paper from OpenAI.
Aha moment and self-verification
One of the coolest aspects of DeepSeek-R1-Zero’s development (which also applies to the flagship R-1 model) is just how good the model became at reasoning. There were sophisticated reasoning behaviors that were not explicitly programmed but arose through its reinforcement learning process.
Over thousands of training steps, the model began to self-correct, reevaluate flawed logic, and validate its own solutions—all within its chain of thought
An example of this noted in the paper, referred to as a the “Aha moment” is below in red text.
In this instance, the model literally said, “That’s an aha moment.” Through DeepSeek’s chat feature (their version of ChatGPT) this type of reasoning usually emerges with phrases like “Wait a minute” or “Wait, but...,”
Limitations and challenges in DeepSeek-R1-Zero
While DeepSeek-R1-Zero was able to perform at a high level, there were some drawbacks with the model.
- Language mixing and coherence issues: The model occasionally produced responses that mixed languages (Chinese and English)
- Reinforcement learning trade-offs: The absence of supervised fine-tuning (SFT) meant that the model lacked the refinement required for fully polished, human-aligned outputs.
DeepSeek-R1 was developed to address these issues!
DeepSeek-R1
To tackle the readability and coherence issues of R1-Zero, the researchers incorporated a cold-start fine-tuning phase and a multi-stage training pipeline when building DeepSeek-R1:
Cold-Start Fine-Tuning:
- Researchers prepared a high-quality dataset of long chains of thought examples for initial supervised fine-tuning (SFT). This data was collected using:
- Few-shot prompting with detailed CoT examples
- Post-processed outputs from DeepSeek-R1-Zero, refined by human annotators
Reinforcement Learning:
- DeepSeek-R1 underwent the same RL process as DeepSeek-R1-Zero to refine its reasoning capabilities further
Human Preference Alignment:
- A secondary RL stage improved the model’s helpfulness and harmlessness, ensuring better alignment with user needs
Distillation to Smaller Models:
- DeepSeek-R1’s reasoning capabilities were distilled into smaller, efficient models like Qwen and Llama-3.1-8B, and Llama-3.3-70B-Instruct
DeepSeek R-1 performance
The researchers tested DeepSeek R-1 across a variety of benchmarks and against top models: o1, GPT-4o, and Claude 3.5 Sonnet, o1-mini.
The benchmarks were broken down into several categories, shown below in the table: English, Code, Math, and Chinese.
Setup
The following parameters were applied across all models:
- Maximum generation length: 32,768 tokens.
- Sampling configuration:
- Temperature: 0.6.
- Top-p value: 0.95.
- Pass@1 estimation: Generated 64 responses per query.
- DeepSeek R1 outperformed o1, Claude 3.5 Sonnet and other models in the majority of reasoning benchmarks
- o1 was the best-performing model in four out of the five coding-related benchmarks
- DeepSeek performed well on creative and long-context task task, like AlpacaEval 2.0 and ArenaHard, outperforming all other models
Prompt Engineering with Reasoning Models
My favorite part of the article was the researchers’ observation about DeepSeek-R1’s sensitivity to prompts:
Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitiveto prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using azero-shot setting for optimal results.
This is another datapoint that aligns with insights from our Prompt Engineering with Reasoning Models Guide, which references Microsoft’s research on their MedPrompt framework. In their study with OpenAI’s o1-preview model, they found that overwhelming reasoning models with few-shot context degraded performance—a sharp contrast to non-reasoning models.
The key takeaway? Zero-shot prompting with clear and concise instructions seem to be best when using reasoning models. For more best practices with reasoning models here’s another resource: How prompt engineering differs with reasoning models
Conclusion
The recent release from DeepSeek is very important for several reasons:
- Reinforcement learning-only training: R1-Zero demonstrates the feasibility of RL-alone approaches for building high-performing reasoning models
- A reasoning model competition: DeepSeek-R1 matches or beats OpenAI’s o1 model across more than a few benchmarks, but notably is much worse on code related tasks
- Open-source: Since everything is open-source, I’m sure many more leanings from this release