DeepSeek, the Chinese AI lab behind the reasoning model DeepSeek R-1 which launched last week, isn’t slowing down. They’ve just released a new multimodal model variant: Janus-Pro-7B.
Janus-Pro-7B is a multimodal model; think DALL-E, but it can both accept images as input and generate images based on a prompt.
Building on R-1’s success, which outperformed OpenAI’s o1 on several reasoning benchmarks (more on there here: DeepSeek R-1 Model Overview and How it Ranks Against OpenAI's o1), Janus-Pro-7B also crushed it on benchmarks.
Janus-Pro achieved 80% overall accuracy in text-to-image tasks, compared to 67% for DALL-E 3 and 74% for Stable Diffusion, and set a new benchmark with 99% single-object accuracy and 90% positional alignment.
Benchmarks aside, our side-by-side comparisons showed Janus-Pro-7B is better at creating realistic images, though it struggles significantly with generating humans.
In addition to delivering a high-performing model, the researchers introduced a decoupled architecture for visual understanding and generation, effectively leveraged synthetic data to enhance training, and achieved state-of-the-art performance through data scaling techniques.
We’ll start by diving into how Janus-Pro-7B’s decoupled design sets it apart from other multimodal models.
You can check out the full technical report here if you'd like.
Decoupling in Janus-Pro-7B
One of the most interesting aspects of this model, beyond its benchmark performance, is its decoupled architecture for handling multimodal tasks. Essentially, they separated visual understanding (when the model is sent an image to process) and text-to-image generation (when the model needs to generate an image).
In other multimodal models, there is a single encoder that handles both understanding and generating images. It’s a simpler architecture, but it has its drawbacks.
How it works
The researchers designed two separate encoders within the model, each dedicated to a specific multimodal task:
Understanding encoder
- Specialized in analyzing and understanding images (identifying objects, interpreting relationships, understanding scenes, etc).
Generation encoder
- Optimized for generating images from text prompts, with an emphasis on creativity, aesthetics, and compositional accuracy.
Each encoder sends its output to the same model to produce the final output.
Benefits of decoupling
- Performance gains: Janus-Pro topped benchmarks for both multimodal accuracy and text-to-image generation.
- More efficient: Using separate encoders for different tasks allows each encoder to be fine-tuned and trained on narrower, task-specific datasets.
Scaling up Janus-Pro-7B and its training data
The previous version of Janus had only 1.5B parameters. Scaling up to 7B is a major reason for the performance increase. The increase in size made the model better at both visual reasoning, and image generation.
In addition to increasing the model size, the researchers scaled up the training data for both multimodal understanding and text-to-image generation. This included a mix of real-world data and synthetic data to boost performance over previous model variants.
90 million samples were added, including image captioning datasets, structured data with tables and charts, and specialized datasets for MEME understanding and Chinese conversational contexts.
The role of synthetic data
What I found most interesting was the use of synthetic data and how well it worked.
The previous Janus variant relied heavily on real-world visual data, which were often inconsistent, of poor quality, or visually underwhelming (see example below).
When training Janus-Pro-7B, the researchers used synthetic data to enhance the quality and stability of image generation. They added 72 million high-quality synthetic data examples, which brought the balance of the dataset to a 1:1 ratio of real world data to synthetic data.
Leveraging synthetic data had a few major benefits:
- Faster training convergence: The model learned more efficiently with clean, high-quality synthetic data.
- Stability and aesthetics: Even for complex prompts, outputs became more stable and visually appealing.
- Enhanced alignment: Improved semantic alignment between text prompts and generated images.
Janus-Pro-7B’s performance
Janus-Pro-7B was tested across a variety of tasks like text-to-image generation, handling dense prompts, and multimodal understanding.
We’ll dive into the metrics that matter below, and specifically how Janus compares to DALL-E 3.
Text-to-Image generation
- Overall accuracy (GenEval Benchmark): This score reflects the model’s ability to generate outputs that match the input prompts across various tasks.
- Janus-Pro-7B: 80%
- DALL-E 3: 67%
- Stable Diffusion 3 Medium: 74%
Specific metrics:
- Single-object accuracy: Measures how well the model generates individual objects from simple prompts like "a red apple on a white plate.”
- Janus-Pro-7B: 99%
- DALL-E 3: 96%
- Positional alignment: Evaluates the model’s ability to place objects in specific locations based on a prompt like “a cat sitting on the left side of a sofa.”
- Janus-Pro-7B: 90%
- DALL-E 3: 83%
- Color and attribute alignment: Measures how accurately the model matches visual details, like generating 'a golden retriever with a blue collar”
- Janus-Pro-7B: 79% (color), 66% (attributes).
- DALL-E 3: 43% (color), 45% (attributes).
Handling dense prompts: DPG-Bench
Janus-Pro-7B was also tested on DPG-Bench, a benchmark that evaluates a model's ability to handle dense image generation prompts. For example, “a black dog sitting between a bush and a pair of green pants standing up with nobody inside of them”
- Overall Score:
- Janus-Pro-7B: 84.19
- DALL-E 3: 83.50
Key Metrics:
- Attribute Alignment:
- Janus-Pro-7B: 89.4%
- DALL-E 3: 88.39%
- Relation Handling:
- Janus-Pro-7B: 89.32%
- DALL-E 3: 90.58% (slight edge).
Side-by-Side comparison with DALLE-3 and Janus-Pro-7B
Benchmarks only tell part of the story. Let's look at a few side-by-side examples of generated images from DALL-E 3 and Janus-Pro-7B.
Prompt: A photo of a potted plant and a backpack
- Technically, the Janus-Pro-7B output adheres more closely to the prompt since the DALL-E 3 image has two potted plants
- The Janus-Pro-7B image looks more realistic
- The DALL-E 3 image is more aesthetically appealing
Prompt: A black dog sitting between a bush and a pair of green pants standing up with nobody inside of them
- DALL-E 3 adheres to the instructions better since it correctly places the dog between a bush and a pair of pants
- The Janus-Pro-7B image looks more realistic
Prompt: A men eating spaghetti with a hat
- Janus-Pro-7B really struggles with fingers and facial features
- DALL-E 3 nails all the human-related features
Prompt: A spaceship that looks like the Sydney Opera House
- DALL-E 3 adheres to the prompt better since Janus-Pro-7B’s image doesn’t have qualities that we often associated with spaceships
- The Janus-Pro-7B image looks more realistic
Takeaways
Janus-Pro-7B images are more realistic but struggle heavily with human figures. DALL-E 3, while often less realistic, is very good at adhering to prompts and generating human-related features. DALL-E's outputs tend to have a polished "AI-generated" aesthetic.
Conclusion
Janus-Pro-7B is the real deal. The side-by-side comparisons back up the benchmark performance, though it greatly struggles when generating images of humans. You could make the argument that Janus-Pro-7B is much better at producing images that look like the real-world in comparison to DALL-E 3. All in all, another highly performant model from DeepSeek.