3D object detection based on Light Detection and Ranging (LiDAR) point clouds has received a lot of attention in recent years due to its wide applications in smart cities and autonomous driving. Cascaded frameworks have made progress in 2D object detection, but are less studied in 3D. The traditional cascaded structure uses multiple independent sub-networks for successive refinement. However, this method is mediocre in ranging, and it is difficult to achieve the desired performance improvement in 3D space. In this paper, we propose a novel cascaded framework called Cascaded Attention (CasA) for 3D object detection from LiDAR point clouds. CasA consists of a Region Proposal Network (RPN) and a Cascaded Refinement Network (CRN). In CRN, a new cascaded attention module (CAM) is designed, which uses multiple sub-networks and 15 attention modules to aggregate object features from different stages and gradually refine region proposals. CasA can be integrated into various two-stage 3D detectors to improve their performance. Extensive experiments on KITTI and Waymo datasets prove the versatility and superiority of CasA. In particular, a variant based on voxel region-based convolutional neural networks (RCNN) achieves state-of-the-art results on the KITTI dataset. On the KITTI online 3D object detection leaderboard, average precision (AP) of 83.06%, 47.09%, and 73.47% are obtained in the medium car, 26 pedestrian, and cyclist categories, respectively.
A novel cascade framework, CasA, is proposed for object detection from lidar point clouds, which progressively improves and supplements predictions through multiple sub-networks to obtain high-quality predictions. CasA can significantly improve the performance of 3D object detection.
A CAM is proposed to aggregate object features at different stages. CAM comprehensively considers the quality of the previous stages, significantly improving the accuracy of proposal refinement.
- Algorithm process

Fig.1
CasA is a multi-stage detection framework that can be integrated into various two-stage 3D detectors. Current multi-stage methods and cascaded structures use a series of independent sub-networks to improve results. In general, these methods can learn object features under various conditions. However, in independent sub-networks, later stages have limited ability to improve predictions over all previous stages.
Our idea is to aggregate features from all stages in a cascaded attention manner. As shown in Figure 1, CasA consists of RPN and CRN. RPN first uses a 3-D backbone network and a 2-D detection head to generate region proposals. The CRN consists of multiple sub-networks that progressively refine proposals. In this CRN, a novel cascaded attention scheme is developed, which aggregates proposed features from different stages for more comprehensive bounding box prediction.
A.Cascade Attention for Proposal Refinement
- Cascade Attention for Proposal Refinement
Vanilla Cascade Structure: The cascade detection framework has been well studied in 2D images. Cascade R-CNN uses a common cascaded structure, which uses a series of separate sub-networks and increases the IoU threshold to refine the regions.
2.Feature Aggregation Through Cascade Attention
Features from different stages are aggregated to increase object appearance for more accurate detection of distant and difficult objects. For the first refinement stage, our module actually performs a self-attention operation. For other stages, a cross-attention operation is performed to aggregate features from different stages. By adopting this cascaded attention design, CasA can better assess the quality of all stages, which helps to improve the accuracy
3.Box Regression and Part-Aided Scoring
For box regression, we follow [10], [19], which regress residuals relative to the input 3-D box size, position, and orientation. A Part-aided scoring is also designed to enhance the confidence prediction (see Figure 2). This is inspired by part-sensitive warping [15], which averages object scores in part-score maps, and such a design helps improve confidence estimates.

Figure 2 Part-aided scoring
4.Boxes Voting
3-D inspection is more challenging due to the need for object height and non-axis-aligned angle estimation. Errors tend to propagate with downstream multi-level frameworks. To further address this issue, during testing, we propose box voting to create more connections between stages. This is driven by the intuition that each stage outputs weak and strong predictions that can be combined to generate more accurate predictions. With this in mind, we explore ways to merge the detection boxes of all refiners. A simple approach is to directly perform non-maximum suppression (NMS) on all boxes and combine the results by selecting the box with the highest confidence. However, it ignores many boxes with low confidence which have the potential to recover lost objects. To address this, we employ weighted box voting, which directly averages the detection confidences and combines the detection confidence-weighted boxes as

where C and B are the merged confidence and box respectively. After box voting, we get boxes with higher accuracy. Still, there are a lot of redundant boxes, as there are many well-established proposals for each object. To remove redundant boxes, we finally perform NMS on the voting results to produce detection outputs. By employing a voting mechanism, various predictions (with lower confidence and from different perspectives/scales) produced by different refiners can be combined in a complementary manner into a more accurate/reliable final prediction.
B.Backbone Network
Many recent methods [4], [19] use 3D sparse convolutions as the backbone network to improve accuracy and efficiency, and we also adopt this setting. We first segment the original point P into small voxels. For each voxel, we compute the raw features using the average of the raw features of all interior points. We employ 3D sparse convolutions to encode 3D point clouds into feature volumes. Here, the 3D sparse convolution consists of a series of 3×3×3 3D sparse convolution kernels, which downsample spatial features into 1×, 2×, 4×, and finally downsampled into 8× tensors. The 3D features in the last layer are compressed into BEV features along the height dimension for generating object proposals
C. Region Proposal Network
Recent work [10], [19], generates object proposals by applying a series of 2D convolutions on the BEV feature map, and generates object proposals from the BEV map. Specifically, we first predefine Np object templates called anchors on the last layer of the BEV map. Object proposals are generated by classifying anchors and regressing the residuals of object size, location, and azimuth relative to the ground truth box. Similar to [10], [19], ground truth bounding boxes are assigned to anchors via IoU-based matching. The proposal network loss is defined as

D.Overall Training Loss
CasA can be trained end-to-end with RPN loss LRPN and CRN loss LCRN. Combine the two losses with equal weights as L=LRPN+LCRN. The CRN loss is the sum of multiple refinement losses in multiple stages. At each refinement stage, a boxs regression loss Lreg and a score loss Lscore are adopted, such as [10], [19]. For the i-th proposal in the j-th refinement stage, we denote the score prediction, score target, residual prediction CRN defined as

Experimental results

