Publications

LiDAR-CS dataset: LiDAR point cloud dataset with cross-sensors for 3D object detection

Published in ICRA 2024, 2024

LiDAR devices are widely used in autonomous driving scenarios and researches on 3D point cloud achieve remarkable progress over the past years. However, deep learning-based methods heavily rely on the annotation data and often face the domain generalization problem. Unlike 2D images whose domains are usually related to the texture information, the feature extracted from the 3D point cloud is affected by the distribution of the points. Due to the lack of a 3D domain adaptation benchmark, the common practice is to train the model on one benchmark (e.g, Waymo) and evaluate it on another dataset (e.g. KITTI). However, in this setting, there are two types of domain gaps, the scenarios domain, and sensors domain, making the evaluation and analysis complicated and difficult. To handle this situation, we propose LiDAR Dataset with Cross-Sensors (LiDAR-CS Dataset), which contains large-scale annotated LiDAR point cloud under 6 groups of different sensors but with same corresponding scenarios, captured from hybrid realistic LiDAR simulator. As far as we know, LiDAR-CS Dataset is the first dataset focused on the sensor (e.g., the points distribution) domain gaps for 3D object detection in real traffic. Furthermore, we evaluate and analyze the performance with several baseline detectors on the LiDAR-CS benchmark and show its applications.

Recommended citation:

Fang, Jin, Dingfu Zhou, Jingjing Zhao, Chulin Tang, Cheng-Zhong Xu, and Liangjun Zhang. "LiDAR-CS dataset: LiDAR point cloud dataset with cross-sensors for 3D object detection." arXiv preprint arXiv:2301.12515 (2023). 
https://arxiv.org/pdf/2301.12515

Multi-sem fusion: multimodal semantic fusion for 3D object detection

Published in IEEE Transactions on Geoscience and Remote Sensing, 2024

LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

Recommended citation:

Xu, Shaoqing, Fang Li, Ziying Song, Jin Fang, Sifen Wang, and Zhi-Xin Yang. "Multi-sem fusion: multimodal semantic fusion for 3D object detection." IEEE Transactions on Geoscience and Remote Sensing (2024). 
https://arxiv.org/pdf/2212.05265

Semi-supervised 3D object detection with proficient teachers

Published in European Conference on Computer Vision, 2022

Dominated point cloud-based 3D object detectors in autonomous driving scenarios rely heavily on the huge amount of accurately labeled samples, however, 3D annotation in the point cloud is extremely tedious, expensive and time-consuming. To reduce the dependence on large supervision, semi-supervised learning (SSL) based approaches have been proposed. The Pseudo-Labeling methodology is commonly used for SSL frameworks, however, the low-quality predictions from the teacher model have seriously limited its performance. In this work, we propose a new Pseudo-Labeling framework for semi-supervised 3D object detection, by enhancing the teacher model to a proficient one with several necessary designs. First, to improve the recall of pseudo labels, a Spatialtemporal Ensemble (STE) module is proposed to generate sufficient seed boxes. Second, to improve the precision of recalled boxes, a Clusteringbased Box Voting (CBV) module is designed to get aggregated votes from the clustered seed boxes. This also eliminates the necessity of sophisticated thresholds to select pseudo labels. Furthermore, to reduce the negative influence of wrongly pseudo-labeled samples during the training, a soft supervision signal is proposed by considering Box-wise Contrastive Learning (BCL). The effectiveness of our model is verified on both ONCE and Waymo datasets. For example, on ONCE, our approach significantly improves the baseline by 9.51 mAP. Moreover, with half annotations, our model outperforms the oracle model with full annotations on Waymo.

Recommended citation:

Yin, Junbo, Jin Fang, Dingfu Zhou, Liangjun Zhang, Cheng-Zhong Xu, Jianbing Shen, and Wenguan Wang. "Semi-supervised 3D object detection with proficient teachers." In European Conference on Computer Vision, pp. 727-743. Cham: Springer Nature Switzerland, 2022. 
https://arxiv.org/pdf/2207.12655

ProposalContrast: Unsupervised Pre-training for LiDAR-Based 3D Object Detection

Published in European conference on computer vision, 2022

Existing approaches for unsupervised point cloud pre-training are constrained to either scene-level or point/voxel-level instance discrimination. Scene-level methods tend to lose local details that are crucial for recognizing the road objects, while point/voxel-level methods inherently suffer from limited receptive field that is incapable of perceiving large objects or context environments. Considering region-level representations are more suitable for 3D object detection, we devise a new unsupervised point cloud pre-training framework, called ProposalContrast, that learns robust 3D representations by contrasting region proposals. Specifically, with an exhaustive set of region proposals sampled from each point cloud, geometric point relations within each proposal are modeled for creating expressive proposal representations. To better accommodate 3D detection properties, ProposalContrast optimizes with both inter-cluster and inter-proposal separation, i.e., sharpening the discriminativeness of proposal representations across semantic classes and object instances. The generalizability and transferability of ProposalContrast are verified on various 3D detectors (i.e., PV-RCNN, CenterPoint, PointPillars and PointRCNN) and datasets (i.e., KITTI, Waymo and ONCE).

Recommended citation:

Yin, Junbo, Dingfu Zhou, Liangjun Zhang, Jin Fang, Cheng-Zhong Xu, Jianbing Shen, and Wenguan Wang. "Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection." In European conference on computer vision, pp. 17-33. Cham: Springer Nature Switzerland, 2022. 
https://arxiv.org/pdf/2207.12654

Context-aware 3D object detection from a single image in autonomous driving

Published in IEEE Transactions on Intelligent Transportation Systems, 2022

Camera sensors have been widely used in Driver-Assistance and Autonomous Driving Systems due to their rich texture information. Recently, with the development of deep learning techniques, many approaches have been proposed to detect objects in 3D from a single frame, however, there is still much room for improvement. In this paper, we generally review the recently proposed state-of-the-art monocular-based 3D object detection approaches first. Based on the analysis of the disadvantage of previous center-based frameworks, a novel feature aggregation strategy has been proposed to boost the 3D object detection by exploring the context information. Specifically, an Instance-Guided Spatial Attention (IGSA) module is proposed to collect the local instance information and the Channel-Wise Feature Attention (CWFA) module is employed for aggregating the global context information. In addition, an instance-guided object regression strategy is also proposed to alleviate the influence of center location prediction uncertainty in the inference process. Finally, the proposed approach has been verified on the public 3D object detection benchmark. The experimental results show that the proposed approach can significantly boost the performance of the baseline method on both 3D detection and 2D Bird??s-Eye View among all three categories. Furthermore, our method outperforms all the monocular-based methods (even these trained with depth as auxiliary inputs) and achieves state-of-the-art performance on the KITTI benchmark.

Recommended citation:

Zhou, Dingfu, Xibin Song, Jin Fang, Yuchao Dai, Hongdong Li, and Liangjun Zhang. "Context-aware 3D object detection from a single image in autonomous driving." IEEE Transactions on Intelligent Transportation Systems 23, no. 10 (2022): 18568-18580. 
https://ieeexplore.ieee.org/abstract/document/9729810

Lidar-aug: A general rendering-based augmentation framework for 3d object detection

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

Annotating the LiDAR point cloud is crucial for deep learning-based 3D object detection tasks. Due to expensive labeling costs, data augmentation has been taken as a necessary module and plays an important role in training the neural network." Copy" and" paste"(ie, GT-Aug) is the most commonly used data augmentation strategy, however, the occlusion between objects has not been taken into consideration. To handle the above limitation, we propose a rendering-based LiDAR augmentation framework (ie, LiDAR-Aug) to enrich the training data and boost the performance of LiDAR-based 3D object detectors. The proposed LiDAR-Aug is a plug-and-play module that can be easily integrated into different types of 3D object detection frameworks. Compared to the traditional object augmentation methods, LiDAR-Aug is more realistic and effective. Finally, we verify the proposed framework on the public KITTI dataset with different 3D object detectors. The experimental results show the superiority of our method compared to other data augmentation strategies. We plan to make our data and code public to help other researchers reproduce our results.

Recommended citation:

Fang, Jin, Xinxin Zuo, Dingfu Zhou, Shengze Jin, Sen Wang, and Liangjun Zhang. "Lidar-aug: A general rendering-based augmentation framework for 3d object detection." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4710-4720. 2021. 
https://openaccess.thecvf.com/content/CVPR2021/papers/Fang_LiDAR-Aug_A_General_Rendering-Based_Augmentation_Framework_for_3D_Object_Detection_CVPR_2021_paper.pdf

Autoshape: Real-time shape-aware monocular 3d object detection

Published in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021

Existing deep learning-based approaches for monocular 3D object detection in autonomous driving often model the object as a rotated 3D cuboid while the object's geometric shape has been ignored. In this work, we propose an approach for incorporating the shape-aware 2D/3D constraints into the 3D detection framework. Specifically, we employ the deep neural network to learn distinguished 2D keypoints in the 2D image domain and regress their corresponding 3D coordinates in the local 3D object coordinate first. Then the 2D/3D geometric constraints are built by these correspondences for each object to boost the detection performance. For generating the ground truth of 2D/3D keypoints, an automatic model-fitting approach has been proposed by fitting the deformed 3D object model and the object mask in the 2D image. The proposed framework has been verified on the public KITTI dataset and the experimental results demonstrate that by using additional geometrical constraints the detection performance has been significantly improved as compared to the baseline method. More importantly, the proposed framework achieves state-of-the-art performance with real time. Data and code will be available at https://github. com/zongdai/AutoShape

Recommended citation:

Liu, Zongdai, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. "Autoshape: Real-time shape-aware monocular 3d object detection." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15641-15650. 2021. 
https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_AutoShape_Real-Time_Shape-Aware_Monocular_3D_Object_Detection_ICCV_2021_paper.pdf

Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection

Published in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021

Accurate detection of obstacles in 3D is an essential task for autonomous driving and intelligent transportation. In this work, we propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task. Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector. First, semantic information is obtained for 2D images and 3D Lidar point clouds based on 2D and 3D segmentation approaches. Then the segmentation results from different sensors are adaptively fused based on the proposed attention-based semantic fusion module. Finally, the point clouds painted with the fused semantic label are sent to the 3D detector for obtaining the 3D objection results. The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark by comparing it with three different baselines. The experimental results show that the fusion strategy can significantly improve the detection performance compared to the methods using only point clouds, and the methods using point clouds only painted with 2D segmentation information. Furthermore, the proposed approach outperforms other state-of-the-art methods on the nuScenes testing benchmark.

Recommended citation:

Xu, Shaoqing, Dingfu Zhou, Jin Fang, Junbo Yin, Zhou Bin, and Liangjun Zhang. "Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection." In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 3047-3054. IEEE, 2021. 
https://arxiv.org/pdf/2106.12449

Invisible for both camera and lidar: Security of multi-sensor fusion based perception in autonomous driving under physical-world attacks

Published in 2021 IEEE symposium on security and privacy (SP), 2021

In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.

Recommended citation:

Cao, Yulong, Ningfei Wang, Chaowei Xiao, Dawei Yang, Jin Fang, Ruigang Yang, Qi Alfred Chen, Mingyan Liu, and Bo Li. "Invisible for both camera and lidar: Security of multi-sensor fusion based perception in autonomous driving under physical-world attacks." In 2021 IEEE symposium on security and privacy (SP), pp. 176-194. IEEE, 2021. 
https://arxiv.org/pdf/2106.09249

MLDA-Net: Multi-level dual attention-based network for self-supervised monocular depth estimation

Published in IEEE Transactions on Image Processing, 2021

The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

Recommended citation:

Song, Xibin, Wei Li, Dingfu Zhou, Yuchao Dai, Jin Fang, Hongdong Li, and Liangjun Zhang. "MLDA-Net: Multi-level dual attention-based network for self-supervised monocular depth estimation." IEEE Transactions on Image Processing 30 (2021): 4691-4705. 
https://ieeexplore.ieee.org/abstract/document/9416235

Large Scale Autonomous Driving Scenarios Clustering with Self-supervised Feature Extraction

Published in IV 2021, 2021

The clustering of autonomous driving scenario data can substantially benefit the autonomous driving validation and simulation systems by improving the simulation tests' completeness and fidelity. This article proposes a comprehensive data clustering framework for a large set of vehicle driving data. Existing algorithms utilize handcrafted features whose quality relies on the judgments of human experts. Additionally, the related feature compression methods are not scalable for a large data-set. Our approach thoroughly considers the traffic elements, including both in-traffic agent objects and map information. Meanwhile, we proposed a self-supervised deep learning approach for spatial and temporal feature extraction to avoid biased data representation. With the newly designed driving data clustering evaluation metrics based on data-augmentation, the accuracy assessment does not require a human-labeled data-set, which is subject to human bias. Via such unprejudiced evaluation metrics, we have shown our approach surpasses the existing methods that rely on handcrafted feature extractions.

Recommended citation:

Zhao, Jinxin, Jin Fang, Zhixian Ye, and Liangjun Zhang. "Large scale autonomous driving scenarios clustering with self-supervised feature extraction." In 2021 IEEE Intelligent Vehicles Symposium (IV), pp. 473-480. IEEE, 2021. 
https://arxiv.org/pdf/2103.16101

MapFusion: A General Framework for 3D Object Detection with HDMaps

Published in IROS 2021, 2021

3D object detection is a key perception component in autonomous driving. Most recent approaches are based on Lidar sensors only or fused with cameras. Maps (e.g., High Definition Maps), a basic infrastructure for intelligent vehicles, however, have not been well exploited for boosting object detection tasks. In this paper, we propose a simple but effective framework - MapFusion to integrate the map information into modern 3D object detector pipelines. In particular, we design a FeatureAgg module for HD Map feature extraction and fusion, and a MapSeg module as an auxiliary segmentation head for the detection backbone. Our proposed MapFusion is detector independent and can be easily integrated into different detectors. The experimental results of three different baselines on large public autonomous driving dataset demonstrate the superiority of the proposed framework. By fusing the map information, we can achieve 1.27 to 2.79 points improvements for mean Average Precision (mAP) on three strong 3d object detection baselines.

Recommended citation:

Fang, Jin, Dingfu Zhou, Xibin Song, and Liangjun Zhang. "Mapfusion: A general framework for 3d object detection with hdmaps." In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3406-3413. IEEE, 2021. 
https://arxiv.org/pdf/2103.05929

Rotpredictor: Unsupervised canonical viewpoint learning for point cloud classification

Published in 2020 international conference on 3D vision (3DV), 2020

Recently, significant progress has been achieved in analyzing the 3D point cloud with deep learning techniques. However, existing networks suffer from poor generalization and robustness to arbitrary rotations applied to the input point cloud. Different from traditional strategies that improve the rotation robustness with data augmentation or specifically designed spherical representation or harmonics-based kernels, we propose to rotate the point cloud into a canonical viewpoint for boosting the following downstream target task, e.g., object classification and part segmentation. Specifically, the canonical viewpoint is predicted by the network RotPredictor in an unsupervised way and the loss function is only built on the target task. Our RotPredictor satisfies the rotation equivariance property in (3) approximately and the predication output has the linear relationship with the applied rotation transformation. In addition, the RotPredictor is an independent plug and play module, which can be employed by any point-based deep learning framework without extra burden. Experimental results on the public model classification dataset ModelNet40 show the performance for all baselines can be boosted by integrating the proposed module. In addition, by adding our proposed module, we can achieve the state-of-the-art classification accuracy with 90.2% on the rotation-augmented ModelNet40 benchmark.

Recommended citation:

Fang, Jin, Dingfu Zhou, Xibin Song, Shengze Jin, Ruigang Yang, and Liangjun Zhang. "Rotpredictor: Unsupervised canonical viewpoint learning for point cloud classification." In 2020 international conference on 3D vision (3DV), pp. 987-996. IEEE, 2020. 
https://ieeexplore.ieee.org/abstract/document/9320323

Joint 3d instance segmentation and object detection for autonomous driving

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Currently, in Autonomous Driving (AD), most of the 3D object detection frameworks (either anchor-or anchor-free-based) consider the detection as a Bounding Box (BBox) regression problem. However, this compact representation is not sufficient to explore all the information of the objects. To tackle this problem, we propose a simple but practical detection framework to jointly predict the 3D BBox and instance segmentation. For instance segmentation, we propose a Spatial Embeddings (SEs) strategy to assemble all foreground points into their corresponding object centers. Base on the SE results, the object proposals can be generated based on a simple clustering strategy. For each cluster, only one proposal is generated. Therefore, the Non-Maximum Suppression (NMS) process is no longer needed here. Finally, with our proposed instance-aware ROI pooling, the BBox is refined by a second-stage network. Experimental results on the public KITTI dataset show that the proposed SEs can significantly improve the instance segmentation results compared with other feature embedding-based method. Meanwhile, it also outperforms most of the 3D object detectors on the KITTI testing benchmark.

Recommended citation:

Zhou, Dingfu, Jin Fang, Xibin Song, Liu Liu, Junbo Yin, Yuchao Dai, Hongdong Li, and Ruigang Yang. "Joint 3d instance segmentation and object detection for autonomous driving." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1839-1849. 2020. 
https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhou_Joint_3D_Instance_Segmentation_and_Object_Detection_for_Autonomous_Driving_CVPR_2020_paper.pdf

Iafa: Instance-aware feature aggregation for 3d object detection from a single image

Published in Proceedings of the Asian Conference on Computer Vision, 2020

3D object detection from a single image is an important task in Autonomous Driving (AD), where various approaches have been proposed. However, the task is intrinsically ambiguous and challenging as single image depth estimation is already an ill-posed problem. In this paper, we propose an instance-aware approach to aggregate useful information for improving the accuracy of 3D object detection with the following contributions. First, an instance-aware feature aggregation (IAFA) module is proposed to collect local and global features for 3D bounding boxes regression. Second, we empirically find that the spatial attention module can be well learned by taking coarse-level instance annotations as a supervision signal. The proposed module has significantly boosted the performance of the baseline method on both 3D detection and 2D bird-eye's view of vehicle detection among all three categories. Third, our proposed method outperforms all single image-based approaches (even these methods trained with depth as auxiliary inputs) and achieves state-of-the-art 3D detection performance on the KITTI benchmark.

Recommended citation:

Zhou, Dingfu, Xibin Song, Yuchao Dai, Junbo Yin, Feixiang Lu, Miao Liao, Jin Fang, and Liangjun Zhang. "Iafa: Instance-aware feature aggregation for 3d object detection from a single image." In Proceedings of the Asian Conference on Computer Vision. 2020. 
https://openaccess.thecvf.com/content/ACCV2020/papers/Zhou_IAFA_Instance-Aware_Feature_Aggregation_for_3D_Object_Detection_from_a_ACCV_2020_paper.pdf

Deep fusionnet for point cloud semantic segmentation

Published in ECCV 2020, 2020

Many point cloud segmentation methods rely on transferring irregular points into a voxel-based regular representation. Although voxel-based convolutions are useful for feature aggregation, they produce ambiguous or wrong predictions if a voxel contains points from different classes. Other approaches (such as PointNets and point-wise convolutions) can take irregular points for feature learning. But their high memory and computational costs (such as for neighborhood search and ball-querying) limit their ability and accuracy for large-scale point cloud processing. To address these issues, we propose a deep fusion network architecture (FusionNet) with a unique voxel-based ??mini-PointNet?? point cloud representation and a new feature aggregation module (fusion module) for large-scale 3D semantic segmentation. Our FusionNet can learn more accurate point-wise predictions when compared to voxel d convolutional networks. It can realize more effective feature aggregations with lower memory and computational complexity for large-scale point cloud segmentation when compared to the popular point-wise convolutions. Our experimental results show that FusionNet can take more than one million points on one GPU for training to achieve state-of-the-art accuracy on large-scale Semantic KITTI benchmark. The code will be available at https://github.com/feihuzhang/LiDARSeg.

Recommended citation:

Zhang, Feihu, Jin Fang, Benjamin Wah, and Philip Torr. "Deep fusionnet for point cloud semantic segmentation." In Computer Vision?CECCV 2020: 16th European Conference, Glasgow, UK, August 23?C28, 2020, Proceedings, Part XXIV 16, pp. 644-663. Springer International Publishing, 2020. 
https://ora.ox.ac.uk/objects/uuid:80c17ed9-01ef-486e-bea5-962cc4b56528/download_file?safe_filename=Deep%2520FusionNet.pdf&type_of_work=Conference+item

Instance segmentation of lidar point clouds

Published in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020

We propose a robust baseline method for instance segmentation which are specially designed for large-scale outdoor LiDAR point clouds. Our method includes a novel dense feature encoding technique, allowing the localization and segmentation of small, far-away objects, a simple but effective solution for single-shot instance prediction and effective strategies for handling severe class imbalances. Since there is no public dataset for the study of LiDAR instance segmentation, we also build a new publicly available LiDAR point cloud dataset to include both precise 3D bounding box and point-wise labels for instance segmentation, while still being about 3~20 times as large as other existing LiDAR datasets. The dataset will be published at https://github.com/feihuzhang/LiDARSeg.

Recommended citation:

Zhang, Feihu, Chenye Guan, Jin Fang, Song Bai, Ruigang Yang, Philip HS Torr, and Victor Prisacariu. "Instance segmentation of lidar point clouds." In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9448-9455. IEEE, 2020. 
https://ora.ox.ac.uk/objects/uuid:0756a5a1-c855-4a99-afda-76f91c30906f/files/sh989r349k

Autoremover: Automatic object removal for autonomous driving videos

Published in Proceedings of the AAAI Conference on Artificial Intelligence, 2020

Motivated by the need for photo-realistic simulation in autonomous driving, in this paper we present a video inpainting algorithm AutoRemover, designed specifically for generating street-view videos without any moving objects. In our setup we have two challenges: the first is the shadow, shadows are usually unlabeled but tightly coupled with the moving objects. The second is the large ego-motion in the videos. To deal with shadows, we build up an autonomous driving shadow dataset and design a deep neural network to detect shadows automatically. To deal with large ego-motion, we take advantage of the multi-source data, in particular the 3D data, in autonomous driving. More specifically, the geometric relationship between frames is incorporated into an inpainting deep neural network to produce high-quality structurally consistent video output. Experiments show that our method outperforms other state-of-the-art (SOTA) object removal algorithms, reducing the RMSE by over 19%.

Recommended citation:

Zhang, Rong, Wei Li, Peng Wang, Chenye Guan, Jin Fang, Yuhang Song, Jinhui Yu, Baoquan Chen, Weiwei Xu, and Ruigang Yang. "Autoremover: Automatic object removal for autonomous driving videos." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12853-12861. 2020. 
https://aaai.org/ojs/index.php/AAAI/article/view/6982/6836

Augmented LiDAR simulator for autonomous driving

Published in IEEE Robotics and Automation Letters, 2020

In Autonomous Driving (AD), detection and tracking of obstacles on the roads is a critical task. Deep-learning based methods using annotated LiDAR data have been the most widely adopted approach for this. Unfortunately, annotating 3D point cloud is a very challenging, time- and money-consuming task. In this paper, we propose a novel LiDAR simulator that augments real point cloud with synthetic obstacles (e.g., cars, pedestrians, and other movable objects). Unlike previous simulators that entirely rely on CG models and game engines, our augmented simulator bypasses the requirement to create high-fidelity background CAD models. Instead, we can simply deploy a vehicle with a LiDAR scanner to sweep the street of interests to obtain the background point cloud, based on which annotated point cloud can be automatically generated. This unique "scan-and-simulate" capability makes our approach scalable and practical, ready for large-scale industrial applications. In this paper, we describe our simulator in detail, in particular the placement of obstacles that is critical for performance enhancement. We show that detectors with our simulated LiDAR point cloud alone can perform comparably (within two percentage points) with these trained with real data. Mixing real and simulated data can achieve over 95% accuracy.

Recommended citation:

Fang, Jin, Dingfu Zhou, Feilong Yan, Tongtong Zhao, Feihu Zhang, Yu Ma, Liang Wang, and Ruigang Yang. "Augmented LiDAR simulator for autonomous driving." IEEE Robotics and Automation Letters 5, no. 2 (2020): 1931-1938. 
https://arxiv.org/pdf/1811.07112

Iou loss for 2d/3d object detection

Published in international conference on 3D vision (3DV), 2019

In 2D/3D object detection task, Intersection-over-Union (IoU) has been widely employed as an evaluation metric to evaluate the performance of different detectors in the testing stage. However, during the training stage, the common distance loss (\eg, L1 or L2) is often adopted as the loss function to minimize the discrepancy between the predicted and ground truth Bounding Box (Bbox). To eliminate the performance gap between training and testing, the IoU loss has been introduced for 2D object detection in \cite{yu2016unitbox} and \cite{rezatofighi2019generalized}. Unfortunately, all these approaches only work for axis-aligned 2D Bboxes, which cannot be applied for more general object detection task with rotated Bboxes. To resolve this issue, we investigate the IoU computation for two rotated Bboxes first and then implement a unified framework, IoU loss layer for both 2D and 3D object detection tasks. By integrating the implemented IoU loss into several state-of-the-art 3D object detectors, consistent improvements have been achieved for both bird-eye-view 2D detection and point cloud 3D detection on the public KITTI benchmark.

Recommended citation:

Zhou, Dingfu, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. "Iou loss for 2d/3d object detection." In 2019 international conference on 3D vision (3DV), pp. 85-94. IEEE, 2019. 
https://arxiv.org/pdf/1908.03851

AADS: Augmented autonomous driving simulation using data-driven algorithms

Published in Science robotics 4 (28), eaaw0863, 2019

Simulation systems have become an essential component in the development and validation of autonomous driving technologies. The prevailing state-of-the-art approach for simulation is to use game engines or high-fidelity computer graphics (CG) models to create driving scenarios. However, creating CG models and vehicle movements (e.g., the assets for simulation) remains a manual task that can be costly and time-consuming. In addition, the fidelity of CG images still lacks the richness and authenticity of real-world images and using these images for training leads to degraded performance. In this paper we present a novel approach to address these issues: Augmented Autonomous Driving Simulation (AADS). Our formulation augments real-world pictures with a simulated traffic flow to create photo-realistic simulation images and renderings. More specifically, we use LiDAR and cameras to scan street scenes. From the acquired trajectory data, we generate highly plausible traffic flows for cars and pedestrians and compose them into the background. The composite images can be re-synthesized with different viewpoints and sensor models. The resulting images are photo-realistic, fully annotated, and ready for end-to-end training and testing of autonomous driving systems from perception to planning. We explain our system design and validate our algorithms with a number of autonomous driving tasks from detection to segmentation and predictions. Compared to traditional approaches, our method offers unmatched scalability and realism. Scalability is particularly important for AD simulation and we believe the complexity and diversity of the real world cannot be realistically captured in a virtual environment. Our augmented approach combines the flexibility in a virtual environment (e.g., vehicle movements) with the richness of the real world to allow effective simulation of anywhere on earth.

Recommended citation:

Li, Wei, C. W. Pan, Rong Zhang, J. P. Ren, Y. X. Ma, Jin Fang, F. L. Yan et al. "AADS: Augmented autonomous driving simulation using data-driven algorithms." Science robotics 4, no. 28 (2019): eaaw0863. 
https://arxiv.org/pdf/1901.07849