Large language models, especially the LLaMA family of models, aroused great interest in the research community for multimodal models application, where many methods heavily rely on LLaMA for text processing and CLIP-fashioned vision transformers for visual perception.
Can the same transformer be used to process both text and 2D images?
That is what researchers try to address by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA. VisionLLaMA follows the pipeline of ViT and retains the architecture design of LLaMA as closely as possible. For an image of H × W
, it’s firstly transformed and flattened into N = H×W
P 2 non-overlapped patches X ∈ R N×C
. Then a class token is prepended at the beginning of the sequence and the whole sequence is processed by L VisionLLaMA blocks. The basic block differs from the standard ViT block by two components: self-attention with positional encoding (RoPE) and SwiGLU activation. Researchers also introduce AS2DRoPE (i.e. auto-scaled 2D RoPE), which expands rotated positional encoding from 1D to 2D and utilizes interpolation scaling to accommodate arbitrary resolutions. For the Pyramid Transformer, VisionLLaMA is applied to windows based transformers such as Twin that utilize additive relative position encoding Swin.
During experimentation VisionLLaMA was trained either in supervised or self-supervised schemes to validate the power in a myriad of downstream vision tasks like image classification, detection, and segmentation. Particularly VisionLLaMA image generation capacity was explored under the diffusion framework DiT and SiT to confirm its potency. VisionLLaMA converges much faster than ViT across all models. SiT-LLaMA with 300k training iterations even outperforms the baseline with 400k steps. Further, VisionLLaMA converges faster than DeiT3-L. In conclusion, VisionLLaMA has strong potential to serve as a new vision backbone to facilitate a large realm of downstream applications.