Table of Contents

DeepSeek is a Chinese AI company “dedicated to making AGI a reality” and open-sourcing all its models. They started in 2023, but have been making waves over the past month or so, and especially this past week with the release of their two latest reasoning models: DeepSeek-R1-Zero and the more advanced DeepSeek-R1, also known as DeepSeek Reasoner.

They’ve released not only the models but also the code and evaluation prompts for public use, along with a detailed paper outlining their approach.

Aside from creating 2 highly performant models that are on par with OpenAI’s o1 model, the paper has a lot of valuable information around reinforcement learning, chain of thought reasoning, prompt engineering with reasoning models, and more.

We’ll start by focusing on the training process of DeepSeek-R1-Zero, which uniquely relied solely on reinforcement learning, instead of traditional supervised learning. We’ll then move on to DeepSeek-R1, how it’s reasoning works, and some prompt engineering best practices for reasoning models.

Hey everyone, Dan here, co-founder of PromptHub. Today, we’re diving into DeepSeek's latest model release and comparing it with OpenAI’s reasoning models, specifically the A1 and A1 Mini models. We'll explore their training process, reasoning capabilities, and some key insights into prompt engineering for reasoning models.

What is DeepSeek?

DeepSeek is a Chinese-based AI company committed to open-source development. Their recent release, the R1 reasoning model, is groundbreaking due to its open-source nature and innovative training methods. This includes open access to the models, prompts, and research papers.

Released on January 20th, DeepSeek’s R1 achieved impressive performance on various benchmarks, rivaling OpenAI’s A1 models. Notably, they also launched a precursor model, R10, which serves as the foundation for R1.

Training Process: R10 to R1

R10: This model was trained exclusively using reinforcement learning without supervised fine-tuning, making it the first open-source model to achieve high performance through this approach. Training involved:

  • Rewarding correct answers in deterministic tasks (e.g., math problems).
  • Encouraging structured reasoning outputs using templates with “<think>” and “<answer>” tags.

Through thousands of iterations, R10 developed longer reasoning chains, self-verification, and even reflective behaviors. For example, during training, the model demonstrated "aha" moments and self-correction behaviors, which are rare in traditional LLMs.

R1: Building on R10, R1 added several enhancements:

  • Curated datasets with long Chain of Thought examples.
  • Incorporation of R10-generated reasoning chains.
  • Human preference alignment for polished responses.
  • Distillation into smaller models (LLaMA 3.1 and 3.3 at various sizes).

Performance Benchmarks

DeepSeek’s R1 model performs on par with OpenAI’s A1 models across many reasoning benchmarks:

  • Reasoning and Math Tasks: R1 rivals or outperforms A1 models in accuracy and depth of reasoning.
  • Coding Tasks: A1 models generally perform better in LiveCode Bench and CodeForces tasks.
  • Simple QA: R1 often outpaces A1 in structured QA tasks (e.g., 47% accuracy vs. 30%).

One notable finding is that longer reasoning chains generally improve performance. This aligns with insights from Microsoft’s Med-Prompt framework and OpenAI’s observations on test-time compute and reasoning depth.

Challenges and Observations

Despite its strengths, R1 has some limitations:

  • Mixing English and Chinese responses due to a lack of supervised fine-tuning.
  • Less polished responses compared to chat models like OpenAI’s GPT.

These issues were addressed during R1’s refinement process, including supervised fine-tuning and human feedback.

Prompt Engineering Insights

A fascinating takeaway from DeepSeek’s research is how few-shot prompting degraded R1’s performance compared to zero-shot or concise tailored prompts. This aligns with findings from the Med-Prompt paper and OpenAI’s recommendations to limit context in reasoning models. Overcomplicating the input can overwhelm the model and reduce accuracy.

Conclusion

DeepSeek’s R1 is a significant step forward for open-source reasoning models, demonstrating capabilities that rival OpenAI’s A1. It’s an exciting time to experiment with these models and their chat interface, which is free to use.

If you have questions or want to learn more, check out the resources linked below. See you next time!

Training DeepSeek-R1-Zero: A reinforcement learning-only approach

DeepSeek-R1-Zero stands out from most other state-of-the-art models because it was trained using only reinforcement learning (RL), no supervised fine-tuning (SFT). This challenges the current conventional approach and opens up new opportunities to train reasoning models with less human intervention and effort.

DeepSeek-R1-Zero is the first open-source model to validate that advanced reasoning capabilities can be developed purely through RL.

Without pre-labeled datasets, the model learns through trial and error, refining its behavior, parameters, and weights based solely on feedback from the solutions it generates.

DeepSeek-R1-Zero is the base model for DeepSeek-R1.

The RL process for DeepSeek-R1-Zero

The training process for DeepSeek-R1-Zero involved presenting the model with various reasoning tasks, ranging from math problems to abstract logic challenges. The model generated outputs and was evaluated based on its performance.

DeepSeek-R1-Zero received feedback through a reward system that helped guide its learning process:

  • Accuracy rewards: Evaluates whether the output is correct. Used for when there are deterministic results (math problems).
  • Format rewards: Encouraged the model to structure its reasoning within <think> and </think> tags

Training prompt template

To train DeepSeek-R1-Zero to generate structured chain of thought sequences, the researchers used the following prompt training template, replacing {{prompt}} with the reasoning question. You can access it in PromptHub here.

DeepSeek r1 training template in PromptHub

This template prompted the model to explicitly outline its thought process within <think> tags before delivering the final answer in <answer> tags.

The power of RL in reasoning

With this training process DeepSeek-R1-Zero began to produce sophisticated reasoning chains.

Through thousands of training steps, DeepSeek-R1-Zero evolved to solve increasingly complex problems. It learned to:

  • Generate long reasoning chains that enabled deeper and more structured problem-solving
  • Perform self-verification to cross-check its own answers (more on this later)
  • Correct its own mistakes, showcasing emergent self-reflective behaviors

DeepSeek R1-Zero performance

While DeepSeek-R1-Zero is primarily a precursor to DeepSeek-R1, it still achieved high performance on several benchmarks. Let’s dive into some of the experiments ran.

Accuracy improvements during training

A chart showing accuracy versus number of training sets for r1-zero and openai o1

  • Pass@1 accuracy began at 15.6% and by the end of the training it improved to 71.0%, comparable to OpenAI’s o1-0912 model
  • The red solid line represents performance with majority voting (similar to ensembling and self-consistency techniques), which increased accuracy further to 86.7%, surpassing o1-0912

Next we’ll look at a table comparing DeepSeek-R1-Zero’s performance across multiple reasoning datasets against OpenAI’s reasoning models.

  • AIME 2024: 71.0% Pass@1, slightly below o1-0912 but above o1-mini. 86.7% cons@64, beating both o1 and o1-mini
  • MATH-500: Achieved 95.9%, beating both o1-0912 and o1-mini
  • GPQA Diamond: Outperformed o1-mini with a score of 73.3%
  • Performed much worse on coding tasks (CodeForces and LiveCode Bench)

Next we'll look at how the response length increased throughout the RL training process.

Average length per response versus training step

This graph shows the length of responses from the model as the training process progresses. Each “step” represents one cycle of the model’s learning process, where feedback is provided based on the output's performance, evaluated using the prompt template discussed earlier.

For each question (corresponding to one step), 16 responses were sampled, and the average accuracy was calculated to ensure stable evaluation.

As training progresses, the model generates longer reasoning chains, allowing it to solve increasingly complex reasoning tasks by leveraging more test-time compute.

While longer chains don’t always guarantee better results, they generally correlate with improved performance—a trend also observed in the MEDPROMPT paper (read more about it here) and in the original o1 paper from OpenAI.

A series of lines on a graph comparing accuracy versus shorter reasoning chains and longer ones
Graph from Microsoft's MEDPROMPT paper. The extended reasoning prompt led to more tokens generated in o1's chain-of-thought and higher accuracy

Two graphs related to OpenAI's 1 model, showing accuracy versus training and accuracy versus test-time compute
On the right graph you'll see that accuracy increases as test-time compute (reasoning tokens generated) increases

Aha moment and self-verification

One of the coolest aspects of DeepSeek-R1-Zero’s development (which also applies to the flagship R-1 model) is just how good the model became at reasoning. There were sophisticated reasoning behaviors that were not explicitly programmed but arose through its reinforcement learning process.

Over thousands of training steps, the model began to self-correct, reevaluate flawed logic, and validate its own solutions—all within its chain of thought

An example of this noted in the paper, referred to as a the “Aha moment” is below in red text.

A chain of thought example with a self-reflection moment highlighted in red in the middle

In this instance, the model literally said, “That’s an aha moment.” Through DeepSeek’s chat feature (their version of ChatGPT)  this type of reasoning usually emerges with phrases like “Wait a minute” or “Wait, but...,”

Limitations and challenges in DeepSeek-R1-Zero

While DeepSeek-R1-Zero was able to perform at a high level, there were some drawbacks with the model.

  1. Language mixing and coherence issues: The model occasionally produced responses that mixed languages (Chinese and English)
  2. Reinforcement learning trade-offs: The absence of supervised fine-tuning (SFT) meant that the model lacked the refinement required for fully polished, human-aligned outputs.

DeepSeek-R1 was developed to address these issues!

DeepSeek-R1

To tackle the readability and coherence issues of R1-Zero, the researchers incorporated a cold-start fine-tuning phase and a multi-stage training pipeline when building DeepSeek-R1:

Cold-Start Fine-Tuning:

  • Researchers prepared a high-quality dataset of long chains of thought examples for initial supervised fine-tuning (SFT). This data was collected using:
    • Few-shot prompting with detailed CoT examples
    • Post-processed outputs from DeepSeek-R1-Zero, refined by human annotators

Reinforcement Learning:

  • DeepSeek-R1 underwent the same RL process as DeepSeek-R1-Zero to refine its reasoning capabilities further

Human Preference Alignment:

  • A secondary RL stage improved the model’s helpfulness and harmlessness, ensuring better alignment with user needs

Distillation to Smaller Models:

DeepSeek R-1 performance

The researchers tested DeepSeek R-1 across a variety of benchmarks and against top models: o1, GPT-4o, and Claude 3.5 Sonnet, o1-mini.

The benchmarks were broken down into several categories, shown below in the table: English, Code, Math, and Chinese.

Setup

The following parameters were applied across all models:

  • Maximum generation length: 32,768 tokens.
  • Sampling configuration:
    • Temperature: 0.6.
    • Top-p value: 0.95.
  • Pass@1 estimation: Generated 64 responses per query.

A table of results comparing DeepSeek R1 against a variety of other top perfroming models

  • DeepSeek R1 outperformed o1, Claude 3.5 Sonnet and other models in the majority of reasoning benchmarks
  • o1 was the best-performing model in four out of the five coding-related benchmarks
  • DeepSeek performed well on creative and long-context task task, like AlpacaEval 2.0 and ArenaHard, outperforming all other models

Prompt Engineering with Reasoning Models

My favorite part of the article was the researchers’ observation about DeepSeek-R1’s sensitivity to prompts:

Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitiveto prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using azero-shot setting for optimal results.

This is another datapoint that aligns with insights from our Prompt Engineering with Reasoning Models Guide, which references Microsoft’s research on their MedPrompt framework. In their study with OpenAI’s o1-preview model, they found that overwhelming reasoning models with few-shot context degraded performance—a sharp contrast to non-reasoning models.

The key takeaway? Zero-shot prompting with clear and concise instructions seem to be best when using reasoning models. For more best practices with reasoning models here’s another resource: How prompt engineering differs with reasoning models

Conclusion

The recent release from DeepSeek is very important for several reasons:

  • Reinforcement learning-only training: R1-Zero demonstrates the feasibility of RL-alone approaches for building high-performing reasoning models
  • A reasoning model competition: DeepSeek-R1 matches or beats OpenAI’s o1 model across more than a few benchmarks, but notably is much worse on code related tasks
  • Open-source: Since everything is open-source, I’m sure many more leanings from this release
Headshot of PromptHub founder Dan Cleary
Dan Cleary
Founder