Over the past few years, significant advancements have blossomed in the two key pillars of multimodal intelligence: understanding and generation. Recent works have tried to form a unified system that can handle both multimodal understanding and generation. However, existing attempts mainly treat each domain independently and often involve individual models responsible for understanding and generation separately.
So can one single transformer handle both multimodal understanding and generation? or can one single transformer involve both autoregressive and diffusion modeling?
To address this researchers have introduced Show-o, which unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities.
First the input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Then Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.
Across various benchmarks, Show-o demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.
Paper : https://arxiv.org/pdf/2408.12528