2022 BEVFusion - A Simple and Robust LiDAR-Camera Fusion Framework
# BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
Metadata
- CiteKey:: liangBEVFusionSimpleRobust2022
- Type:: preprint
- Author:: Tingting Liang Hongwei Xie Kaicheng Yu Zhongyu Xia Zhiwei Lin Yongtao Wang Tao Tang Bing Wang Zhi Tang
- Editor::
- Publisher:: arXiv
- Series::
- Series Number::
- Journal::
- Volume::
- Issue::
- Pages::
- Year:: 2022
- DOI:: 10.48550/arXiv.2205.13790
- ISSN::
- ISBN::
- Format:: PDF
Abstract
Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. However, people discovered that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. This fundamentally limits the deployment capability to realistic autonomous driving scenarios. In contrast, we propose a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. We empirically show that our framework surpasses the state-of-the-art methods under the normal training settings. Under the robustness training settings that simulate various LiDAR malfunctions, our framework significantly surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. To the best of our knowledge, we are the first to handle realistic LiDAR malfunction and can be deployed to realistic scenarios without any post-processing procedure. The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
Files and Links
- Url:: http://arxiv.org/abs/2205.13790
- Uri:: http://zotero.org/users/5055703/items/8JYRU79S
- File:: liang_et_al_2022_bevfusion.pdf
- Local Library:: liang_et_al_2022_bevfusion.pdf
Tags and Collections
- Keywords:: Computer Science - Computer Vision and Pattern Recognition, ⭐⭐⭐
- Collections:: CCAM
# Zotero Notes
Comment: Accepted at NeurIPS 2022
# Annotations
# Imported: 2022-12-17 4:06 pm
- ["] Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. However, people discovered that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. This fundamentally limits the deployment capability to realistic autonomous driving scenarios. Page 1
- ["] To the best of our knowledge, we are the first to handle realistic LiDAR malfunction and can be deployed to realistic scenarios without any post-processing procedure. Page 1
- ["] it is often difficult to regress 3D bounding boxes on pure image inputs due to the lack of depth information, and similarly, it is difficult to classify objects on point clouds when LiDAR does not receive enough points. Page 1
- ["] As one needs to generate image queries from LiDAR points, the current LiDAR-camera fusion methods intrinsically depend on the raw point cloud of the LiDAR sensor Page 2
- ["] We argue the ideal framework for LiDAR-camera fusion should be, that each model for a single modality should not fail regardless of the existence of the other modality, yet having both modalities will further boost the perception accuracy. Page 2
- ["] As our framework is a general approach, we can incorporate current single modality BEV models for camera and LiDAR into our framework. Page 2
- ["] An overlooked assumption of the current fusion mechanism is they heavily rely on the LiDAR point clouds, in fact, if the LiDAR input is missing, these methods will inevitably fail. This will hinder the deployment of such algorithms in realistic settings. Page 3
- ["] Similarly, our framework can incorporate any network that transforms LiDAR points into BEV features Page 5
- ["] We conduct comprehensive experiments on a large-scale autonomous-driving dataset for 3D detection, nuScenes Page 6
- ["] In this paper, we introduce BEVFusion, a surprisingly simple yet unique LiDAR-camera fusion framework that disentangles the LiDAR-camera fusion dependency of previous methods. Our framework comprises two separate streams that encode raw camera and LiDAR sensor inputs into features in the same BEV space, followed by a simple module to fuse these features such that they can be passed into modern task prediction head architectures. The extensive experiments demonstrate the strong robustness and generalization ability of our framework against the various camera and LiDAR malfunctions. Page 10