xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

llm
research paper
Author

Santosh Sawant

Published

August 19, 2024

Large Multimodal Models (LMMs) have attracted significant attention with their potential applications and emergent capabilities. However, recent works have demonstrated that large-scale and high-quality data are essential for training robust LMMs. Lack of such data has widened the gap between open-source models and proprietary ones in terms of access to open weights, training recipes, and curated datasets.

In response to these challenges, researcher have introduce xGen-MultiModal or xGen-MM (BLIP-3), a new framework designed to scale up LMM training by utilizing an ensemble of multimodal interleaved datasets, curated caption datasets, and other publicly available datasets. In xGen-MM (BLIP-3) uses scalable vision token sampler and simplifies the training objectives to focus solely on the auto-regressive loss of text tokens in a multimodal context. xGen-MM uses Free-form interleaved images and texts from the ensembled interleaved and caption datasets with each modality undergoing a separate tokenization process to be fed into the pre-trained LLM in natural order. A standard auto-regressive loss is then applied to the text tokens. The Vision Transformer is kept frozen during training, while all other parameters, including the token sampler and the pre-trained LLM, are trained.

Further, researchers have also released scaling up the training data: MINT-1T , a trillion-token scale interleaved dataset; BLIP3-KALE, a knowledge-augmented high-quality dense captions dataset. BLIP3-OCR-200M, a large-scale dataset with dense OCR annotations; and BLIP3-GROUNDING-50M, a large-scale visual grounding dataset.

Paper : https://arxiv.org/pdf/2408.08872