Top 5 LiDAR-Based Object Detection Models

CATAGORY
SHARE ON

Table of Contents

The framework of anchor free one stage 3D detection (AFDet) system and detailed structure of anchor free detector

1. Lidar-Based Object Detection

1.1 Inputs and Outputs of Object Detection

Inputs:

  • Points with X, Y, Z coordinates and reflection intensity R
  • Point clouds: collections of multiple points (unordered, unstructured data)

Outputs:

  • Object class and confidence
  • Object bounding box: 3D center, dimensions (length, width, height), rotation angle
  • Additional object information (velocity, acceleration, etc.)

Algorithms:

  • Point cloud representation: point view, bird’s eye view, frontal view

1.2 Point Cloud Dataset

Common databases include KITTI, NuScenes, and Waymo Open Dataset (WOD)

1.3 Lidar Object Detection Algorithms

For clarity, here is a list of some common lidar object detection algorithms:

Algorithm TypeAlgorithms
Point ViewPointNet/PointNet++, Point-RCNN, 3D-SSD
Bird’s Eye ViewVoxelNet, SECOND, PIXOR, AFDet
Frontal ViewLaserNet, RangeDet
Multiview Fusion (Bird’s Eye View + Point View)PointPillar, SIENet, PV-CNN
Multiview Fusion (Bird’s Eye View + Frontal View)MV3D, RSN

 

2 Point View

2.1 PointNet

PointNet-Segmentation Network
PointNet-Segmentation Network

Qi et al., “Pointnet: Deep learning on point sets for 3d classification and segmentation,” 2017. Link

  • Uses:
    1. Recognition/Classification: determine the class of objects within a point cloud
    2. Segmentation: divide a point cloud into distinct regions with unique properties
  • Core idea: point cloud feature extraction
    1. MLP (multiple fully connected layers) to extract point features: increase feature dimension from 3 to 1024
    2. Max pooling to get global features: 1024-dimensional output

End-to-end learning for classification/segmentation

Object detection: clustering to generate candidates + PointNet for classification

2.2 PointNet++

Illustration of hierarchical feature learning architecture
Illustration of hierarchical feature learning architecture

Qi et al., “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” 2017. Link

  • Extends PointNet for object detection:Clustering+PointNet
    1. Clustering to generate multiple point sets, then use PointNet to extract features for each set
    2. Repeat process multiple times: previous layer’s point sets become input points for the next layer (Set Abstraction, SA)
    3. Point features have large receptive fields and include contextual information
  • PointNetand PointNet++ Issues:
    1. Unable to use mature detection frameworks in the visual field, such as Faster-RCNN, YOLO, etc.
    2. The computational complexity of the Clustering part is high and difficult to process in parallel.

2.3 Point-RCNN

The PointRCNN architecture for 3D object detection from point cloud
The PointRCNN architecture for 3D object detection from point cloud

Shi et al., “PointRCNN: 3D object proposal generation and detection from point cloud,” 2018. Link

Two-stage detection network: first stage does foreground point segmentation to identify object points, second stage uses foreground points to regress precise bounding boxes

  • Point handling + Faster RCNN
  • PointNet++ for point feature extraction and foreground segmentation
  • Each foreground point generates a 3D candidate box (similar to PointNet++ with clustering)
  • Pooling for points within each candidate box, output class and refine box position/size

2.4 3D-SSD

Comparision between representative points after fusion sampling (top) and D-FPS only (bottom)
Comparision between representative points after fusion sampling (top) and D-FPS only (bottom)

Yang et al., “3DSSD: Point-based 3D single stage object detector,” 2020. Link

  • Improves clustering quality
    1. Considers both geometric and feature space similarity between points
    2. Clustering output can directly generate object candidates
  • Avoids redundant computation
    1. The clustering algorithm outputs the center and neighbor points of each cluster
    2. Avoid global search for matching relationships between object candidates and points

2.5 Summary and Comparison

  • The main problem of PointNet++ is that it runs too slowly
  • The speed bottleneck lies in the need to map point set features back to the original point cloud during the clustering process
  • The improvements of Point RCNN and 3D-SSD are mainly to increase the running speed

3 Bird’s Eye View

3.1 VoxelNet

VoxelNet architecture
VoxelNet architecture

Zhou and Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” 2018. Link

  • Core components:
    1. Feature Learning Network
    2. 3D Convolutional Middle Layers
    3. Region Proposal Network
  • Issues with VoxelNet
    1. Inefficient data representation: many empty voxels
    2. Large computational cost of 3D convolution

3.2 SECOND

Results of 3D detection on the KITTI test set
Results of 3D detection on the KITTI test set

Yan et al., “SECOND: Sparsely embedded convolutional detection,” 2018. Link

  • Uses sparse convolution to avoid empty voxel computations
  • Otherwise similar to VoxelNet

3.3 PIXOR

Overview of the proposed 3D object detector from Bird’s Eye View (BEV) of LIDAR point cloud.
Overview of the proposed 3D object detector from Bird’s Eye View (BEV) of LIDAR point cloud.

Yang et al., “Pixor: Real-time 3d object detection from point clouds,” 2018. Link

PIXOR (Oriented 3D object detection from pixel-wise neural network predictions)

  • Hand-designed height-based features
  • Compresses 3D grid to 2D: height dimension becomes feature channel
  • Can use 2D convolution for feature extraction
    1. Occupancy: L x W x H (H dimension as feature channel)
    2. Intensity: L x W x 1 (H direction compressed to 1 dimension)
    3. In totalLxWx (H+1)

3.4 AFDet

The framework of anchor free one stage 3D detection (AFDet) system and detailed structure of anchor free detector
The framework of anchor free one stage 3D detection (AFDet) system and detailed structure of anchor free detector

Ge et al., “Real-Time Anchor-Free Single-Stage 3D Detection with IoU-Awareness,” 2021. Link

  • Anchor-free, single-stage
  • Won 2021 Waymo 3D detection challenge
  • Algorithm improvements:
    1. Lightweight point cloud feature extraction
    2. Enlarged neural network receptive field
    3. Additional prediction branch

3.5 Summary and Comparison

  • Bird’s Eye View
    1. Input structured data, simple network structure
    2. Sensitive to quantization parameters: coarse grids lead to information loss, fine grids have high computational cost and memory use
  • Point View
    1. No quantization loss, compact data representation
    2. Input unstructured data, complex network structure, hard to parallelize, difficult to extract local features

4 Frontal View

Advantages:

  • Compact representation, no quantization loss
  • Data for every pixel

Challenges:

  • Large size variation for objects at different distances
  • 2D features don’t align perfectly with 3D object information

4.1 LaserNet

An overview of our approach to 3D object detection
An overview of our approach to 3D object detection

Meyer et al., “LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving,” 2019. Link

  • Input: multiple-channel frontal view images
  • Convolutional and downsampling layers extract multi-scale features
  • Every pixel predicts a distribution (mean and variance) for object bounding boxes
  • MeanShift clustering + NMS for final output

4.2 RangeDet

The overall architecture of RangeDet.
The overall architecture of RangeDet.

Fan et al., “RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection,” 2021. Link

Key components:

  • Meta-Kernel Convolution
  • Range Conditioned Pyramid

5 Multiview Fusion (Bird’s Eye View + Point View)

  • Basic strategy:
    1. Extract local features or generate object candidates on low-resolution voxels (bird’s eye view)
    2. Extract point features on original point cloud (point view)
    3. Combine voxel and point features
  • Representative methods: PointPillar, PV-CNN, SIENet

5.1 PointPillar

PointPillars:Network overview
PointPillars:Network overview

Lang et al., “PointPillars: Fast Encoders for Object Detection from Point Clouds,” 2019. Link

  • PointNet for point feature extraction, then voxelization (bird’s eye view)
  • Feature Pyramid Network main body
  • SSD detection head

5.2 SIENet

The pipeline of our proposed Spatial Information Enhancement Network (SIENet). The whole framework consists of two stages: stage-1 uses the hybrid-paradigm RPN (HP-RPN) to extract features and generate accurate proposals, while stage-2 is designed to produce high-quality 3D bounding boxes via spatial information enhancement (SIE) module.
The pipeline of our proposed Spatial Information Enhancement Network (SIENet)

Li et al., “SIENet: Spatial Information Enhancement Network for 3D Object Detection from Point Cloud,” 2021. Link

  • Similar fusion strategy to PV-CNN
  • Addresses sparsity of point clouds for distant objects
  • Additional branch to expand point sets within object candidates

5.3 PV-CNN

Point-Based Feature Transformation (Fine-Grained)
Point-Based Feature Transformation (Fine-Grained)

Liu et al., “Point-voxel CNN for efficient 3d deep learning,” 2019. Link

  • Voxel branch: extract local features on low-resolution voxels, map back to points
  • Point branch: MLP for point feature extraction, no quantization loss, avoids empty voxel computations
  • Combine voxel and point features for subsequent detection

6 Multiview Fusion (Bird’s Eye View + Frontal View)

 

  • Basic strategy:
    1. Fusion of features from top view and front view
    2. Try to avoid invalid calculations in blank areas
  • Representative methods: MV3D, RSN

6.1 MV3D

Multi-View 3D object detection network (MV3D): The network takes the bird’s eye view and front view of LIDAR point cloud as well as an image as input. It first generates 3D object proposals from bird’s eye view map and project them to three views. A deep fusion network is used to combine region-wise features obtained via ROI pooling for each view. The fused features are used to jointly predict object class and do oriented 3D box regression.
Multi-View 3D object detection network (MV3D)

Chen et al., “Multi-view 3d object detection network for autonomous driving,” 2017. Link

  • Generate 3D object candidates on bird’s eye view grid, transform to other views
  • ROI-Pooling within candidates on different views
  • Fuse features on candidate level across views

6.2 RSN

Example pedestrian and vehicle detection results of CarS 3f and PedS 3f on the Waymo Open Dataset validation set. Light gray boxes are ground-truth and teal boxes are our prediction results. Red points are selected foreground points. Ex 1, 2: RSN performs well when objects are close and mostly visible. Both vehicles and pedestrians are predicted with high accuracy, including dynamic vehicles, large vehicles. Ex 3, 4: RSN handles large crowds with severe occlusion with few false positives and false negatives. Many of the false-negatives in Ex 4 have very few points in the ground-truth boxes. Ex 5, 6: Typical failures of RSN are for distant or heavily occluded objects and having very few points observed.
Example pedestrian and vehicle detection results of CarS 3f and PedS 3f on the Waymo Open Dataset

Sun et al., “RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection,” 2021. Link

  • Two-stage detector to improve detection range
    1. Stage 1: foreground segmentation on frontal view to filter background points
    2. Stage 2: voxelization of foreground points, sparse convolution for feature extraction, grid-based detection
    3. Dense frontal view + sparse bird’s eye view

Request User Manual

Request Datasheet