1. Lidar-Based Object Detection
1.1 Inputs and Outputs of Object Detection
Inputs:
- Points with X, Y, Z coordinates and reflection intensity R
- Point clouds: collections of multiple points (unordered, unstructured data)
Outputs:
- Object class and confidence
- Object bounding box: 3D center, dimensions (length, width, height), rotation angle
- Additional object information (velocity, acceleration, etc.)
Algorithms:
- Point cloud representation: point view, bird’s eye view, frontal view
1.2 Point Cloud Dataset
Common databases include KITTI, NuScenes, and Waymo Open Dataset (WOD)
1.3 Lidar Object Detection Algorithms
For clarity, here is a list of some common lidar object detection algorithms:
Algorithm Type | Algorithms |
Point View | PointNet/PointNet++, Point-RCNN, 3D-SSD |
Bird’s Eye View | VoxelNet, SECOND, PIXOR, AFDet |
Frontal View | LaserNet, RangeDet |
Multiview Fusion (Bird’s Eye View + Point View) | PointPillar, SIENet, PV-CNN |
Multiview Fusion (Bird’s Eye View + Frontal View) | MV3D, RSN |
2 Point View
2.1 PointNet

Qi et al., “Pointnet: Deep learning on point sets for 3d classification and segmentation,” 2017. Link
- Uses:
- Recognition/Classification: determine the class of objects within a point cloud
- Segmentation: divide a point cloud into distinct regions with unique properties
- Core idea: point cloud feature extraction
- MLP (multiple fully connected layers) to extract point features: increase feature dimension from 3 to 1024
- Max pooling to get global features: 1024-dimensional output
End-to-end learning for classification/segmentation
Object detection: clustering to generate candidates + PointNet for classification
2.2 PointNet++

Qi et al., “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” 2017. Link
- Extends PointNet for object detection:Clustering+PointNet
- Clustering to generate multiple point sets, then use PointNet to extract features for each set
- Repeat process multiple times: previous layer’s point sets become input points for the next layer (Set Abstraction, SA)
- Point features have large receptive fields and include contextual information
- PointNetand PointNet++ Issues:
- Unable to use mature detection frameworks in the visual field, such as Faster-RCNN, YOLO, etc.
- The computational complexity of the Clustering part is high and difficult to process in parallel.
2.3 Point-RCNN

Shi et al., “PointRCNN: 3D object proposal generation and detection from point cloud,” 2018. Link
Two-stage detection network: first stage does foreground point segmentation to identify object points, second stage uses foreground points to regress precise bounding boxes
- Point handling + Faster RCNN
- PointNet++ for point feature extraction and foreground segmentation
- Each foreground point generates a 3D candidate box (similar to PointNet++ with clustering)
- Pooling for points within each candidate box, output class and refine box position/size
2.4 3D-SSD

Yang et al., “3DSSD: Point-based 3D single stage object detector,” 2020. Link
- Improves clustering quality
- Considers both geometric and feature space similarity between points
- Clustering output can directly generate object candidates
- Avoids redundant computation
- The clustering algorithm outputs the center and neighbor points of each cluster
- Avoid global search for matching relationships between object candidates and points
2.5 Summary and Comparison
- The main problem of PointNet++ is that it runs too slowly
- The speed bottleneck lies in the need to map point set features back to the original point cloud during the clustering process
- The improvements of Point RCNN and 3D-SSD are mainly to increase the running speed
3 Bird’s Eye View
3.1 VoxelNet

Zhou and Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” 2018. Link
- Core components:
- Feature Learning Network
- 3D Convolutional Middle Layers
- Region Proposal Network
- Issues with VoxelNet
- Inefficient data representation: many empty voxels
- Large computational cost of 3D convolution
3.2 SECOND

Yan et al., “SECOND: Sparsely embedded convolutional detection,” 2018. Link
- Uses sparse convolution to avoid empty voxel computations
- Otherwise similar to VoxelNet
3.3 PIXOR

Yang et al., “Pixor: Real-time 3d object detection from point clouds,” 2018. Link
PIXOR (Oriented 3D object detection from pixel-wise neural network predictions)
- Hand-designed height-based features
- Compresses 3D grid to 2D: height dimension becomes feature channel
- Can use 2D convolution for feature extraction
- Occupancy: L x W x H (H dimension as feature channel)
- Intensity: L x W x 1 (H direction compressed to 1 dimension)
- In totalLxWx (H+1)
3.4 AFDet

Ge et al., “Real-Time Anchor-Free Single-Stage 3D Detection with IoU-Awareness,” 2021. Link
- Anchor-free, single-stage
- Won 2021 Waymo 3D detection challenge
- Algorithm improvements:
- Lightweight point cloud feature extraction
- Enlarged neural network receptive field
- Additional prediction branch
3.5 Summary and Comparison
- Bird’s Eye View
- Input structured data, simple network structure
- Sensitive to quantization parameters: coarse grids lead to information loss, fine grids have high computational cost and memory use
- Point View
- No quantization loss, compact data representation
- Input unstructured data, complex network structure, hard to parallelize, difficult to extract local features
4 Frontal View
Advantages:
- Compact representation, no quantization loss
- Data for every pixel
Challenges:
- Large size variation for objects at different distances
- 2D features don’t align perfectly with 3D object information
4.1 LaserNet

Meyer et al., “LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving,” 2019. Link
- Input: multiple-channel frontal view images
- Convolutional and downsampling layers extract multi-scale features
- Every pixel predicts a distribution (mean and variance) for object bounding boxes
- MeanShift clustering + NMS for final output
4.2 RangeDet

Fan et al., “RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection,” 2021. Link
Key components:
- Meta-Kernel Convolution
- Range Conditioned Pyramid
5 Multiview Fusion (Bird’s Eye View + Point View)
- Basic strategy:
- Extract local features or generate object candidates on low-resolution voxels (bird’s eye view)
- Extract point features on original point cloud (point view)
- Combine voxel and point features
- Representative methods: PointPillar, PV-CNN, SIENet
5.1 PointPillar

Lang et al., “PointPillars: Fast Encoders for Object Detection from Point Clouds,” 2019. Link
- PointNet for point feature extraction, then voxelization (bird’s eye view)
- Feature Pyramid Network main body
- SSD detection head
5.2 SIENet

Li et al., “SIENet: Spatial Information Enhancement Network for 3D Object Detection from Point Cloud,” 2021. Link
- Similar fusion strategy to PV-CNN
- Addresses sparsity of point clouds for distant objects
- Additional branch to expand point sets within object candidates
5.3 PV-CNN

Liu et al., “Point-voxel CNN for efficient 3d deep learning,” 2019. Link
- Voxel branch: extract local features on low-resolution voxels, map back to points
- Point branch: MLP for point feature extraction, no quantization loss, avoids empty voxel computations
- Combine voxel and point features for subsequent detection
6 Multiview Fusion (Bird’s Eye View + Frontal View)
- Basic strategy:
- Fusion of features from top view and front view
- Try to avoid invalid calculations in blank areas
- Representative methods: MV3D, RSN
6.1 MV3D

Chen et al., “Multi-view 3d object detection network for autonomous driving,” 2017. Link
- Generate 3D object candidates on bird’s eye view grid, transform to other views
- ROI-Pooling within candidates on different views
- Fuse features on candidate level across views
6.2 RSN

Sun et al., “RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection,” 2021. Link
- Two-stage detector to improve detection range
- Stage 1: foreground segmentation on frontal view to filter background points
- Stage 2: voxelization of foreground points, sparse convolution for feature extraction, grid-based detection
- Dense frontal view + sparse bird’s eye view