Download PDFOpen PDF in browserMonoViM: Enhancing Self-Supervised Monocular Depth Estimation via MambaEasyChair Preprint 1417115 pages•Date: July 25, 2024AbstractIn recent years, self-supervised monocular depth estimation has been widely applied in fields such as autonomous driving and robotics. While Convolutional Neural Networks (CNNs) and Transformers are predominant in this area, they face challenges with efficiently handling long-term dependencies and reducing computational complexity. To address this problem, we propose MonoViM, the first model integrating the Mamba to enhance the efficiency of self-supervised monocular depth estimation. Inspired by recent advancements in State Space Models (SSM), MonoViM integrates the SSM-based Mamba architecture into its encoder stage and employs a 2D selective scanning mechanism. This ensures that each image block acquires contextual knowledge through a compressed hidden state while maintaining a larger receptive field and reducing computational complexity from quadratic to linear. Comprehensive evaluations on the KITTI dataset, with fine-tuning and zero-shot on Cityscapes and Make3D, show that MonoViM outperforms current CNN-based and Transformer-based methods, achieving state-of-the-art performance and excellent generalization. Additionally, MonoViM demonstrates stronger ability in inference speed and GPU utilization than Transformer-based methods, particularly with high-resolution inputs. The code is available at https://github.com/aifeixingdelv/MonoViM Keyphrases: Monocular Depth Estimation, self-supervised learning, state-space models
|