DeepSeek, the Chinese AI lab behind the reasoning model DeepSeek R-1 which launched last week, isn’t slowing down. They’ve just released a new multimodal model variant: Janus-Pro-7B.

Janus-Pro-7B is a multimodal model; think DALL-E, but it can both accept images as input and generate images based on a prompt.

Building on R-1’s success, which outperformed OpenAI’s o1 on several reasoning benchmarks (more on there here: DeepSeek R-1 Model Overview and How it Ranks Against OpenAI's o1), Janus-Pro-7B also crushed it on benchmarks.

Janus-Pro achieved 80% overall accuracy in text-to-image tasks, compared to 67% for DALL-E 3 and 74% for Stable Diffusion, and set a new benchmark with 99% single-object accuracy and 90% positional alignment.

Benchmarks aside, our side-by-side comparisons showed Janus-Pro-7B is better at creating realistic images, though it struggles significantly with generating humans.

In addition to delivering a high-performing model, the researchers introduced a decoupled architecture for visual understanding and generation, effectively leveraged synthetic data to enhance training, and achieved state-of-the-art performance through data scaling techniques.

We’ll start by diving into how Janus-Pro-7B’s decoupled design sets it apart from other multimodal models.

You can check out the full technical report here if you'd like.

Hey everyone, Dan here from PromptHub. We're back again to talk about DeepSeek, which is making waves once more with a brand-new open-source multimodal model. If you want to stay up to date with developments like this, check out our Substack or our blog, where we break these topics down weekly.

What is DeepSeek?

For those unfamiliar, DeepSeek is a Chinese AI company similar to OpenAI. They’ve recently gained attention for their reasoning model, R1, which has outperformed OpenAI’s A1 in some cases. We covered this last week, and you can find more details in our previous write-up. Additionally, their training template for reinforcement learning in R10 is available in PromptHub for you to explore.

DeepSeek’s New Multimodal Model

Just days after releasing R1, DeepSeek launched a multimodal model that, in some ways, performs on par with OpenAI’s DALL·E 3. While it tops several benchmarks, what’s more important is side-by-side comparisons, which we’ll explore in this post.

Architecture and Training

One standout feature of DeepSeek’s multimodal model is its dual-encoder system:

  • One encoder handles visual understanding (for image inputs).
  • Another encoder handles text-to-image generation.

This differs from OpenAI’s models:

  • DALL·E 3 only generates images.
  • GPT-4o only processes images but doesn’t generate them.

Instead of using a single encoder for both tasks, DeepSeek routes each function through specialized encoders before passing them to the Transformer model.

Synthetic Data in Training

DeepSeek heavily relied on synthetic data for training. The first iteration of their model used mostly real-world data, which led to poor results. The latest 7B parameter version leveraged:

  • 72 million total training examples.
  • A 1:1 ratio of synthetic to real data.

This approach improved training speed, output quality, and alignment.

Benchmark Performance

We analyzed three key benchmarks for DeepSeek’s multimodal model:

1. JennyVal (Text-to-Image Generation)

This benchmark tests image generation accuracy across three categories:

  • Simple Objects: e.g., “A red apple” or “A white plane.”
  • Positional Alignment: e.g., “A cat sitting on the left side of the sofa.”
  • Attribute Matching: e.g., “A Golden Retriever wearing a blue collar.”

Overall results:

  • DALL·E 3: 67.0%
  • Janus Pro 7B (DeepSeek): 68.0%

Janus Pro performed better in color and positional alignment tasks, while DALL·E 3 had an edge in object counting.

2. DPG Benchmark (Dense Prompt Generation)

This test involves generating complex images from highly descriptive prompts, such as:

“A black dog sitting between a bush and a pair of green pants standing up with nobody inside them.”

Results:

  • DALL·E 3: 83.5%
  • Janus Pro 7B (DeepSeek): 84.2%

The two models performed closely, with Janus Pro excelling in certain detailed prompts.

Side-by-Side Comparisons

Benchmarks only tell part of the story, so let’s compare actual image generations.

Example 1: A Photo of a Potted Plant and a Backpack

Observations:

  • DALL·E 3: Over-adheres to the prompt by generating two potted plants instead of one.
  • Janus Pro: Produces a more natural-looking image, resembling a real-world photo.

Example 2: A Black Dog Sitting Between a Bush and a Pair of Green Pants

Observations:

  • DALL·E 3: Better adherence to the prompt but retains a distinct AI-generated aesthetic.
  • Janus Pro: More realistic pants rendering, though the model struggles with fine details.

Example 3: Human Figures

Observations:

  • DALL·E 3: Clearly superior, rendering correct hand structure and facial features.
  • Janus Pro: Struggles with human anatomy, producing distorted hands.

Example 4: A Spaceship That Looks Like the Sydney Opera House

Observations:

  • DALL·E 3: Places the spaceship in space, making it clearly recognizable as a sci-fi concept.
  • Janus Pro: Resembles a high-end yacht rather than a spaceship, producing a more photorealistic aesthetic.

Key Takeaways

  • Realism: Janus Pro generates images that often look more like real-world photographs.
  • Human Figures: DALL·E 3 outperforms Janus Pro in rendering hands and facial structures.
  • Prompt Adherence: Each model has strengths—Janus Pro excels in color and position accuracy, while DALL·E 3 better follows detailed prompts.

Final Thoughts

DeepSeek’s Janus Pro 7B is an exciting new open-source alternative to DALL·E 3. While it surpasses DALL·E 3 in some tasks, it struggles with human figures and fine details. As with any model, testing it in real-world applications is key to determining its strengths and weaknesses.

If you're interested in more deep dives like this, be sure to follow our Substack and check out our blog. Happy prompting!

Decoupling in Janus-Pro-7B

One of the most interesting aspects of this model, beyond its benchmark performance, is its decoupled architecture for handling multimodal tasks. Essentially, they separated visual understanding (when the model is sent an image to process) and text-to-image generation (when the model needs to generate an image).

In other multimodal models, there is a single encoder that handles both understanding and generating images. It’s a simpler architecture, but it has its drawbacks.

How it works

The researchers designed two separate encoders within the model, each dedicated to a specific multimodal task:

Understanding encoder

  • Specialized in analyzing and understanding images (identifying objects, interpreting relationships, understanding scenes, etc).

Generation encoder

  • Optimized for generating images from text prompts, with an emphasis on creativity, aesthetics, and compositional accuracy.

Each encoder sends its output to the same model to produce the final output.

A graphic depicting the encoding process for understanding and image generation for DeepSeek's Janus-Pro-7B

Benefits of decoupling

  • Performance gains: Janus-Pro topped benchmarks for both multimodal accuracy and text-to-image generation.
  • More efficient: Using separate encoders for different tasks allows each encoder to be fine-tuned and trained on narrower, task-specific datasets.

Scaling up Janus-Pro-7B and its training data

The previous version of Janus had only 1.5B parameters. Scaling up to 7B is a major reason for the performance increase. The increase in size made the model better at both visual reasoning, and image generation.

A table showing the hyperparameters for Janus-Pro-1B and Janus-Pro-7B
Hyperparameters for Janus-Pro-7B

In addition to increasing the model size, the researchers scaled up the training data for both multimodal understanding and text-to-image generation. This included a mix of real-world data and synthetic data to boost performance over previous model variants.

90 million samples were added, including image captioning datasets, structured data with tables and charts, and specialized datasets for MEME understanding and Chinese conversational contexts.

The role of synthetic data

What I found most interesting was the use of synthetic data and how well it worked.

The previous Janus variant relied heavily on real-world visual data, which were often inconsistent, of poor quality, or visually underwhelming (see example below).

When training Janus-Pro-7B, the researchers used synthetic data to enhance the quality and stability of image generation. They added 72 million high-quality synthetic data examples, which brought the balance of the dataset to a 1:1 ratio of real world data to synthetic data.

Leveraging synthetic data had a few major benefits:

  • Faster training convergence: The model learned more efficiently with clean, high-quality synthetic data.
  • Stability and aesthetics: Even for complex prompts, outputs became more stable and visually appealing.
  • Enhanced alignment: Improved semantic alignment between text prompts and generated images.

A side by side comparison of image generation of a coffee mug for Janus and Janus-Pro-7B
Prompt: A steaming cup of coffee on a wooden table

Janus-Pro-7B’s performance

Janus-Pro-7B was tested across a variety of tasks like text-to-image generation, handling dense prompts, and multimodal understanding.

We’ll dive into the metrics that matter below, and specifically how Janus compares to DALL-E 3.

A table of performance information for text to image experiments and datasets for Janus-Pro-7B and other models
Text-to-image performance on the GenEval Benchmark

Text-to-Image generation

  • Overall accuracy (GenEval Benchmark): This score reflects the model’s ability to generate outputs that match the input prompts across various tasks.
    • Janus-Pro-7B: 80%
    • DALL-E 3: 67%
    • Stable Diffusion 3 Medium: 74%

Specific metrics:

  • Single-object accuracy: Measures how well the model generates individual objects from simple prompts like "a red apple on a white plate.”
    • Janus-Pro-7B: 99%
    • DALL-E 3: 96%
  • Positional alignment: Evaluates the model’s ability to place objects in specific locations based on a prompt like “a cat sitting on the left side of a sofa.”
    • Janus-Pro-7B: 90%
    • DALL-E 3: 83%
  • Color and attribute alignment: Measures how accurately the model matches visual details, like generating 'a golden retriever with a blue collar”
    • Janus-Pro-7B: 79% (color), 66% (attributes).
    • DALL-E 3: 43% (color), 45% (attributes).

A table of results for multimodal understanding benchmarks for Janus-Pro-7B against other models, including DALL-E 3
Performance on DPG-Bench

Handling dense prompts: DPG-Bench

Janus-Pro-7B was also tested on DPG-Bench, a benchmark that evaluates a model's ability to handle dense image generation prompts. For example, “a black dog sitting between a bush and a pair of green pants standing up with nobody inside of them”

  • Overall Score:
    • Janus-Pro-7B: 84.19
    • DALL-E 3: 83.50

Key Metrics:

  • Attribute Alignment:
    • Janus-Pro-7B: 89.4%
    • DALL-E 3: 88.39%
  • Relation Handling:
    • Janus-Pro-7B: 89.32%
    • DALL-E 3: 90.58% (slight edge).

Side-by-Side comparison with DALLE-3 and Janus-Pro-7B

Benchmarks only tell part of the story. Let's look at a few side-by-side examples of generated images from DALL-E 3 and Janus-Pro-7B.

Prompt: A photo of a potted plant and a backpack

A side-by-side comparison of AI-generated images based on the prompt 'A photo of a potted plant and a backpack.' The left image, generated by DALL-E 3, features two potted plants and a gray backpack. The right image, generated by Janus-Pro-7B, shows a single potted plant next to a black backpack.

  • Technically, the Janus-Pro-7B output adheres more closely to the prompt since the DALL-E 3 image has two potted plants
  • The Janus-Pro-7B image looks more realistic
  • The DALL-E 3 image is more aesthetically appealing

Prompt: A black dog sitting between a bush and a pair of green pants standing up with nobody inside of them

A side-by-side comparison of AI-generated images based on the prompt 'A black dog sitting between a bush and a pair of green pants standing up with nobody inside of them.' The left image, generated by DALL-E 3, shows a black dog near a bush and green pants with shoes. The right image, generated by Janus-Pro-7B, depicts a black dog sitting between two pairs of green pants, providing a more realistic interpretation of the prompt.

  • DALL-E 3 adheres to the instructions better since it correctly places the dog between a bush and a pair of pants
  • The Janus-Pro-7B image looks more realistic

Prompt: A men eating spaghetti with a hat

A side-by-side comparison of AI-generated images based on the prompt 'A man eating spaghetti with a hat.' The left image, generated by DALL-E 3, features a well-composed scene with a man in a hat eating spaghetti. The right image, generated by Janus-Pro-7B, portrays a man eating spaghetti with a hat but struggles with realistic facial features and hands.

  • Janus-Pro-7B really struggles with fingers and facial features
  • DALL-E 3 nails all the human-related features

Prompt: A spaceship that looks like the Sydney Opera House

A side-by-side comparison of AI-generated images based on the prompt 'A spaceship that looks like the Sydney Opera House.' The left image, generated by DALL-E 3, depicts a spaceship with architectural elements resembling the Sydney Opera House. The right image, generated by Janus-Pro-7B, shows a similar concept but with a more realistic rendering of a ship-like structure.

  • DALL-E 3 adheres to the prompt better since Janus-Pro-7B’s image doesn’t have qualities that we often associated with spaceships
  • The Janus-Pro-7B image looks more realistic

Takeaways

Janus-Pro-7B images are more realistic but struggle heavily with human figures. DALL-E 3, while often less realistic, is very good at adhering to prompts and generating human-related features. DALL-E's outputs tend to have a polished "AI-generated" aesthetic.

Conclusion

Janus-Pro-7B is the real deal. The side-by-side comparisons back up the benchmark performance, though it greatly struggles when generating images of humans. You could make the argument that Janus-Pro-7B is much better at producing images that look like the real-world in comparison to DALL-E 3. All in all, another highly performant model from DeepSeek.

Headshot of PromptHub Founder Dan Cleary
Dan Cleary
Founder