Introduction
This article mainly introduces some of the most important datasets for 3D object detection currently available on GitHub, including the most popular KITTI dataset and new datasets at the forefront of research, such as multimodal and temporal fusion. The classification methods are as follows: first, the datasets are divided into indoor and outdoor datasets according to the scene. Then, some commonly used indoor and outdoor datasets for 3D object detection are introduced, followed by some easy-to-use projects for research, and some simple similarities and differences are summarized.
Datasets
Indoor Datasets
Research on indoor 3D object detection is a relatively new research task. The current datasets mainly include ScanNetV2 and SUN RGB-D.
ScanNetV2
Official website: http://www.scan-net.org/
Paper link: https://arxiv.org/abs/1702.04405
Benchmark: http://kaldir.vc.in.tum.de/scannet_benchmark/
ScanNetV2 is an indoor scene dataset proposed by Stanford University, Princeton University, and Munich University of Technology at CVPR18SH. ScanNet is an RGB-D video dataset that can be used for semantic segmentation and object detection tasks, with a total of 1513 collected scene data (the number of point clouds in each scene is different, and if end-to-end is used, sampling (FPS sampling) may be required to make the points in each scene the same), with 21 categories of objects, of which 1201 scenes are used for training and 312 scenes are used for testing. The dataset contains 2D and 3D data, and the 2D data includes N frames in each scene (to avoid overlapping information between frames, generally every 50 frames are taken one frame), and the instance data and 2D labels are provided as .png image files. Color images are provided as 8-bit RGB .jpg files, and depth images are provided as 16-bit .png files. Each frame contains color, depth, instance-label, label, and corresponding pose information. The 3D data is a series of .ply files.

SUN RGB-D
Official website: http://rgbd.cs.princeton.edu/
Paper link: http://rgbd.cs.princeton.edu/paper.pdf
The indoor dataset proposed by Princeton University can be used for segmentation and detection tasks. The dataset contains 10,335 RGB-D images, which is similar in scale to Pascal VOC. The entire dataset has dense annotations, including 146,617 two-dimensional polygon annotations and 64,595 three-dimensional bounding boxes with accurate object orientation, as well as the three-dimensional room layout and scene category of each image. This dataset is the union of three datasets: NYU depth v2, Berkeley B3DO, and SUN3D.

Outdoor Datasets
KITTI
Official website for 3D object detection: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
Paper link: http://www.cvlibs.net/publications/Geiger2012CVPR.pdf
The KITTI dataset was jointly established by the Karlsruhe Institute of Technology in Germany and the Toyota Technical Research Institute in the United States. It is currently the largest computer vision algorithm evaluation dataset for autonomous driving scenarios. The dataset is used to evaluate the performance of computer vision technologies such as stereo images, optical flow, visual odometry, 3D object detection, and 3D tracking in a vehicle environment. KITTI contains real image data collected from urban, rural, and highway scenes, with a maximum of 15 cars and 30 pedestrians in each image, as well as various degrees of occlusion and truncation. The entire dataset consists of 389 pairs of stereo images and optical flow images, 39.2 km of visual odometry sequences, and more than 200k 3D annotated objects, sampled and synchronized at a frequency of 10 Hz. Overall, the original dataset is classified as ‘Road’, ‘City’, ‘Residential’, ‘Campus’ and ‘Person’. For 3D object detection, the labels are subdivided into car, van, truck, pedestrian, pedestrian (sitting), cyclist, tram, and misc.

Nuscenes
Official website: https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Any
The dataset consists of 1000 scenes, each with a length of 20 seconds and various scenarios. In each scene, there are 40 key frames (key frames), that is, there are two key frames per second, and the other frames are sweeps. The key frames are manually annotated, and each frame has several annotations in the form of bounding boxes. The annotations include size, range, category, visibility, etc. This dataset recently released a teaser version (containing 100 scenes), and the official version (1000 scenes) was released in 2019. The second version will be released in 2020.

Waymo
Official website: https://waymo.com/open
Official download link: https://waymo.com/open/download/
Official data format analysis: https://waymo.com/open/data/
Code link: https://gitee.com/cmfighting/waymo_read
Waymo, a self-driving company under Google’s parent company Alphabet, announced its data open project (Waymo Open Dataset) on its blog on August 21, 2019. Compared with the previous academic benchmark, this project is a benchmark with bonuses. As far as data is concerned, Waymo includes 3000 driving records with a total duration of 16.7 hours and an average segment length of about 20 seconds; 600,000 frames, with about 25 million 3D bounding boxes and 22 million 2D bounding boxes, as well as diverse autonomous driving scenarios.

PandaSet
Official website: https://scale.com/open-datasets/pandaset
This autonomous driving dataset was collected in San Francisco. It contains 48,000 camera images, 16,000 LiDAR scans, 100+ scenes, each scene is 8 seconds long, and a total of 28 annotation classes and 37 semantic subdivision labels. It is an autonomous driving scene object detection dataset that integrates industry and academia.

Oxford Robotcar
Official website: https://robotcar-dataset.robots.ox.ac.uk/
Paper link: https://robotcar-dataset.robots.ox.ac.uk/images/robotcar_ijrr.pdf
This dataset was proposed by the Oxford University Robotics Laboratory. Its radar is a Navtech CTS350-X millimeter-wave frequency-modulated continuous-wave (FMCW) scanning radar. In the configuration used, it provides a range resolution of 4.38 cm and a rotation resolution of 0.9 degrees, with a maximum range of 163 m.

A*3D
Official website link: https://github.com/I2RDL2/ASTAR-3D#Dataset
Dataset download link: https://github.com/I2RDL2/ASTAR-3D#Download
This dataset is still being updated, and is compared to the KITTI dataset in the paper. It contains 230K annotations of 3D objects in 39179 LIDAR point cloud frames and corresponding front-facing RGB images, all of which are manually labeled. The dataset was collected in Singapore. The experiments in the paper show that the model trained on the A*3D dataset has good performance on KITTI, especially for the moderate and hard categories.

SemanticKITTI
Official website link: http://semantic-kitti.org/
Paper link: https://arxiv.org/abs/1904.01416
Benchmark: https://competitions.codalab.org/competitions/24025#learn_the_details-overview
This dataset is not for 3D object detection, but rather for semantic segmentation in autonomous driving scenarios. It is a large dataset that provides point-by-point labeling for LiDAR data in the KITTI Vision Benchmark. It is based on the odometry task data and provides annotations for 28 classes.

Lyft Level 5
Official website: https://level5.lyft.com/dataset/?source=post_page
This dataset is frequently mentioned in papers, and is collected using 64-line LiDAR and multiple cameras, similar to the KITTI dataset. It includes a high-definition semantic map with over 4,000 manually annotated semantic elements, including lane segments, crosswalks, parking signs, parking areas, speed bumps, and stop signs.

H3D
Dataset official website link: https://usa.honda-ri.com/H3D
Paper link: https://arxiv.org/pdf/1903.01568
This dataset for LiDAR object detection in autonomous driving scenarios is provided by Honda. It is collected from the HDD dataset, which is a large-scale natural driving dataset collected in the San Francisco Bay Area. H3D has a complete 360-degree LiDAR dataset (dense point cloud from Velodyne-64) and 1071302 3D bounding box labels. It also includes temporal information, with manual annotations every 2HZ and linear propagation at 10HZ.

BLVD
Dataset link: https://github.com/VCCIV/BLVD
Paper link: https://arxiv.org/pdf/1903.06405
This dataset focuses on meaningful dynamic changes in objects around a vehicle, and is a large-scale 5D semantic benchmark that is not focused on static detection or semantic/instance segmentation tasks. BLVD aims to provide a platform for dynamic 4D (3D+time) tracking, 5D (4D+interactive) interactive event recognition, and intention prediction tasks. The BLVD dataset includes 654 high-resolution video clips collected from Changshu, Jiangsu Province, China, and contains 120k frames. The dataset is fully annotated, with a total of 249129 3D annotations for tracking and detection tasks.

PreSIL
Official link: https://uwaterloo.ca/waterloo-intelligent-systems-engineering-lab/projects/precise-synthetic-image-and-lidar-presil-dataset-autonomous
Paper link: https://arxiv.org/abs/1905.00160
This autonomous driving dataset is provided and shared by the University of Waterloo. The PreSIL dataset contains more than 50,000 instances, including high-definition images with complete depth information, semantic segmentation (images), point-by-point segmentation (point clouds), gt labels (point clouds), and detailed annotations for all vehicles and pedestrians. The official website shows a 5% average precision improvement on the KITTI 3D object detection benchmark after pre-training the state-of-the-art 3D object detection network on the dataset. The dataset has not yet been released.

PASCAL3D+
Official website link: https://cvgl.stanford.edu/projects/pascal3d.html
This dataset is a 3D model reconstruction of the PASCAL dataset, more like a classification and POS detection dataset.

The Stanford Track Collection
Official link: https://cs.stanford.edu/people/teichman/stc/
The dataset has a relatively small size and was released very early (2011).

IQmulus & TerraMobilita Contest
Official website link: http://data.ign.fr/benchmarks/UrbanAnalysis/#
This is a very dense outdoor dataset that includes multiple tasks such as segmentation and detection, and is more biased towards semantic segmentation tasks. It uses mobile LiDAR (MLS) to scan over 100 scenes in Paris. The dataset is more aimed at inspiring researchers from different fields (such as computer vision, computer graphics, geographic information science, and remote sensing) to work together to process 3D data and benchmark and classify 3D MLS data.

Projects
Here are several important projects for 3D point cloud object detection, which are currently popular codebases:
second.pytorch
Project link: https://github.com/traveller59/second.pytorch

Main advantages:
(1) Contains implementation of two point cloud 3D object detection datasets, Kitti and Nuscence.
(2) Includes a Kitti_viewer webpage visualization tool.
(3) Implements three point cloud object detection algorithms: Second, VoxelNet, and PointPillars.
Disadvantages:
As a pioneer in the field, it may be difficult to read and unfriendly to multi-modality.
Det3D
Project Link: https://github.com/poodarchu/Det3D
Main Advantages:
(1) The code is clear and easy to read after being refactored in second.pytorch
(2) Includes implementation of multiple datasets, including KITTI, Nuscence, Lyft, and Waymo (in progress)
(3) Currently implemented algorithms include voxelnet, sencod, pointpillars, and CBGS (second_multihead)
(4) Apex training acceleration
Disadvantages:
Does not support multimodal and visualization, and has not been updated for a long time.
OpenPCDet
Project Link: https://github.com/open-mmlab/OpenPCDet
Advantages:
(1) Integrated with nusecne, kitti datasets, and waymo (in progress)
(2) Includes algorithms such as pointRCNN, PartA2, voxelnet, PointPillar, SECOND, PV-RCNN, and SECOND-MultiHead (CBGS).
(3) More feature extraction structures, including both point and voxel feature extractors.
(4) Has a visualization demo
(5) Code style is clear and easy to read
Disadvantages:
No multimodal support
mmdetection3d
Project Link: https://github.com/open-mmlab/mmdetection3d
Main Advantages:
(1) Supports multimodal fusion, with 2D and 3D networks combined.
(2) Supports a variety of algorithms and models, with over 40 algorithms and 300 models, including VoteNet, PointPillars, SECOND, and Part-A2. (PVRCNN not included)
(3) Faster training speed than the previously introduced codebase.
(4) Supports a variety of datasets, including indoor and outdoor datasets.
Summary
This blog mainly introduces the datasets and current popular research projects for 3D point cloud object detection in autonomous driving scenarios. As for the projects, the author initially looked at second.second, but the code reading was difficult. The author recommends openpcdet and mmdetection3d, each with its own advantages and disadvantages. Openpcdet is more focused on outdoor point cloud scenes and includes multiple point cloud feature extraction models. Mmdetection3d is a relatively new model structure, with simple assembly of multiple modules, and supports multimodal. From a research perspective, it has more development space.