VideoMamba: State Space Model for Efficient Video Understanding

llm
research paper
Author

Santosh Sawant

Published

March 12, 2024

Mastering spatiotemporal representation is one of the key areas in any video understanding task. However there usually are two challenges associated with it: (1) the large spatiotemporal redundancy within short video clips, and (2) the complex spatiotemporal dependencies among long contexts. Models such as 3D-CNN + Video transformer, S4, RMKV and RetNet tried to resolve above challenges associated with spatio-temporal but none has been successful so far.

So can Mamba work well for video understanding?

That’s what researchers have tried to address with VideoMamba, a purely SSM-based model tailored for video understanding. VideoMamba harmoniously merges the strengths of convolution and attention in vanilla ViT style. It offers a linear-complexity method for dynamic spatiotemporal context modeling, ideal for high-resolution long videos.

Framework of VideoMamba strictly follow the architecture of vanilla ViT and adapt the bidirectional mamba block (B-Mamba) for 3D video sequences. Bidirectional Mamba (B-Mamba) block, adapts bidirectional sequence modeling for vision-specific applications. This block processes flattened visual sequences through simultaneous forward and backward SSMs, enhancing its capacity for spatially-aware processing. To apply the B-Mamba layer for spatiotemporal input, VideoMamba extends the original 2D scan into different Spatial-First bidirectional 3D scan, organizing spatial tokens by location then stacking them frame by frame.

Extensive evaluations reveal VideoMamba’s four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding.

Paper : https://lnkd.in/g8quHTqR

Code : https://lnkd.in/gFN3sbZ5

Model : https://lnkd.in/gH85xRkz