JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

llm
research paper
Author

Santosh Sawant

Published

November 13, 2024

Recently there has been growing trends of developing sophisticated LLM models specialized in both image comprehension and text-to-image generation. This is achieved typically by incorporating either diffusion models or vector-quantized autoregressive models. Another approach builds upon recent breakthroughs in rectified flow models, which provide a simple framework for generative modeling while delivering exceptional empirical performance.

Now researchers have proposed JanusFlow, a powerful unified multimodal model that seamlessly integrates rectified flow with LLM architecture. Architecturally JanusFlow requires only a lightweight encoder and decoder to adapt the LLM for rectified flow operations. For visual understanding, the LLM performs autoregressive next-token prediction to generate responses. For image generation, the LLM employs images with rectified flow. Starting from Gaussian noise at 𝑑 = 0, the LLM iteratively updates 𝑧𝑑 by predicting velocity vectors until reaching 𝑑 = 1. We omit the VAE encoder, the skip connection leveraged in generation and the linear layer after 𝑓𝑒𝑛𝑐 for simplicity.

To further optimize JanusFlow’s performance, two key strategies have been implemented: First, a separate vision encoder is maintained for understanding and generation tasks, preventing task interference and thus enhancing comprehension capabilities. Second, the intermediate representations between generation and understanding modules are aligned during training, strengthening semantic coherence in the generation process.

Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. Specifically, on text-to-image generation benchmarks, MJHQ FID-30k, GenEval and DPG-Bench, JanusFlow achieves scores of 9.51, 0.63 and 80.09%. In multimodal comprehension benchmarks, JanusFlow attains scores of 74.9, 70.5 and 60.3 on MMBench, SeedBench, and GQA, respectively, exceeding specialized models such as LLaVA-v1.5 and Qwen-VL-Chat.

Please refer to the research paper for more details.