Multimodel Large Language Models(MLLMs) have achieved promising OCR free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension.
To address these challenges, researchers have proposed DocOwl2, a High-resolution Doc Compressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. DocOwl2 strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance.
DocOwl2 leverages a Shape-adaptive Cropping Module and a low-resolution vision encoder to encode high-resolution document images. Then, it utilizes a vision-to-text module H-Reducer to ensemble horizontal visual features and align the dimension of vision features with Large Language Models. Furthermore, a high-resolution compressor is designed to greatly reduce the number of visual features while maintaining most visual information. Finally, compressed visual tokens of multiple images/pages are concatenated with text instructions and input to a Large Language Model for multimodal understanding.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
Paper : https://arxiv.org/pdf/2409.03420