Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Djamahl Etchegaray¹, Zi Huang¹, Tatsuya Harada², Yadan Luo¹

¹UQMM Lab, University of Queensland, Brisbane, Australia

²The University of Tokyo, Tokyo, Japan

ECCV 2024
^*Indicates Equal Contribution

Abstract

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies.

While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal Find n' Propagate approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more.

In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank.

Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at github.com/djamahl99/findnpropagate.

Baselines

In this work, we investigate the potential of leveraging OV learning for 3D object detection by employing high-resolution LiDAR data (Top) and multi-view imagery (Bottom). As illustrated in Fig. 1, four baseline solutions are designed: (1) Top-down Projection, (2) Top-down Self-train, (3) Top-down Clustering, and (4) Bottom-up Weakly-supervised 3D detection approaches to facilitate novel object discovery in point clouds.

The foundation of our Top-down strategies is inspired by advancements in 2D OV learning, where one can regress class-agnostic bounding boxes based on base box annotations and subsequently leverage VLMs for open-vocabulary classification. Based on that, the Top-down Self-train is the variant that further enhances open-vocabulary performance through self-training mechanisms. Beyond mere 2D projections, our third Top-down baseline explores the feasibility of applying open-vocabulary 3D segmentation directly to 3D detection tasks, utilizing clustering techniques for 3D bounding box estimation.

Nevertheless, it is observed that Top-down methods can easily overfit to known classes, potentially overlooking novel objects with varying sizes and shapes. As shown in the visualization of Fig. 1, unseen objects that are of vastly different shapes, such as long vehicles like buses or small traffic cones, often go undetected in class-agnostic 3D proposals and are obscured in 2D crops due to occlusion.

The Bottom-up approach presents a cost-effective alternative akin to weakly-supervised 3D object detection, lifting 2D annotations to construct 3D bounding boxes. Different from Top-down counterparts, this approach is training-free and does not rely on any base annotations, potentially making it more generalizable and capable of finding objects with diverse shapes and densities. In Baseline IV, we study FGR (Wei et al., 2021) as an exemplar of Bottom-up Weakly-supervised and evaluate its effectiveness in generating novel proposals. FGR starts with removing background points such as the ground plane, then incorporates the human prior into key-vertex localization to refine box regression. However, their study was limited to regressing car objects, as their vertex localization assumes rectangular objects which do not hold for other classes (e.g., pedestrians).

Our Method

To address the baseline limitations, we propose a novel Find n' Propagate approach to maximize the recall rate of novel objects and then propagate the knowledge to distant regions from the camera progressively. We identify most detection failures of novel objects stem from the uncertainties in 3D object orientation and depth. This observation motivates the development of a Greedy Box Seeker strategy that initiates by generating instance frustums for each unique 2D box prediction region, utilizing Region VLMs such as GLIP (Li et al., 2022), or pre-trained OV 2D models like OWL-ViT (Minderer et al., 2022).

These frustums are segmented into subspaces across different angles and depth levels to facilitate an exhaustive greedy search for the most apt 3D proposal, accommodating a wide variety of shapes and sizes. To control the quality of newly generated boxes, we implement a Greedy Box Oracle that employs two key criteria of multi-view alignment and density ranking to select the most probable proposal. The rationale behind that is that 2D predictions predominantly originate from objects near the camera, characterized by dense point clouds and substantial overlap with the 2D box upon re-projection.

Recognizing that relying solely on pseudo labels generated from these 2D predictions could bias the detector towards objects near the camera and overlook those that are distant or obscured, we propose a Remote Propagator to mitigate the bias. To augment novel pseudo labels with distant object geometries, geometry and density simulators are employed to perturb pseudo label boxes to farther distances from the camera and mimic sparser structures. The refined 3D proposals are subsequently integrated into a memory bank, facilitating iterative training of the detection model.

Qualitative Results

Green boxes represent ground-truth known classes, pink boxes denote ground-truth novel classes, and blue boxes indicate model predictions.

BibTeX

@inproceedings{DBLP:conf/eccv/Etche24, author = {Djamahl Etchegaray and Zi Huang and Tatsuya Harada and Yadan Luo}, title = {Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments}, booktitle = {Computer Vision - {ECCV} 2024 - The 18th European Conference on Computer Vision}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2403.13556} }