RT-DETR

DETRs Beat YOLOs on Real-time Object Detection[1]

作者是来自北大和百度的Yian Zhao, Wenyu Lv等人。论文引用[1]:Lv, Wenyu et al. “DETRs Beat YOLOs on Real-time Object Detection.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 16965-16974.

Time

  • 2024.Apr

Key Words

  • hybrid encoder to process multi-scale features
  • uncertainty-minimal query selection to provide high-quality initial queries to the decoder
  • intra-scale interaction and cross-scale feature interaction

动机

YOLO系列受到了NMS的影响,会降低推理速度。在不同的scenarios下,需要仔细地选择NMS的阈值。DETR不需要手工设计的components,没有NMS,但是计算成本高。因此,探索DETR能够做到实时是一个重要的方向。

总结

  1. YOLO系列变成了最流行的实施目标检测的框架因为trade-off between speed和accuracy。然而,观察到,YOLO的速度和精度收到了NMS的负面影响。端到端的DETR不需要NMS。然而,它的计算成本还是很高。这个不仅降低了推理的速度,也引入了超参数,造成速度和精度的不稳定。DETR去掉了手工设计的component,然而它的计算成本很高,很难做到实时。
  1. multi-scale features的interaction造成了计算成本,实时的DETR需要重新设计encoder;object queries很难优化,阻碍了DETRs的性能。提出了query selection schemes来代替vanilla learnable embeddings with encoder features。然而,观察到,当前的query selection直接采用分类的score for selection,忽略了detector需要同时对目标的category和location同时进行model,这俩都决定了features的质量。这个不可避免地造成了low localization confidence的encoder features被选为了initial queries。因此导致了大量的不确定性。将query initialization视为一个breakthrough,来进一步提高performance

  2. 在本文中,提出了RT-DETR,为了加速处理multi-scale features,设计了一个高效的hybrid encoder来代替vanilla Transformer encoder, 通过解耦intra-scale interaction和 cross-scale fusion of features with different scales,来提高速度。为了避免有低localization confidence的encoder features被选为initial queries,提出了uncertainty-minimal query selection,通过优化不确定性,提供高质量的initial queries给decoder,来提高精度。另外,RT-DETR支持flexible speed tuning,不需要训练来适应不同的实时的场合,多亏了multi-layer decoder architecture of DETR。

  3. DETR有很多的问题:(1)收敛比较慢; (2)计算成本高; (3)queries很难优化; Deformable-DETR通过增强attention机制的效率,with multi-scale features来加速训练的收敛。DAB-DETR和DN-DETR通过引入迭代的refinement scheme和去噪训练,来提高性能。Group-DETR引入了group-wise one-to-many assignment; Efficient DETR和Sparse DETR通过减小encoder和decoder layer的数量或者updated queries的数量来减小计算成本;Lite DETR通过减小low-level feature的更新频率来提高encoder的效率。Conditional DETR和Anchor DETR减小了queries的优化困难;有人也提出了query selection 用于两阶段的DETR; DINO用mixed query selection来帮助更好地初始化queries。

  4. NMS广泛地用于目标检测的后处理,来消除overlapping output boxes。NMS中有两个阈值:confidence threshold和IoU threshold。特别地,box的score低于阈值的会被过滤掉;任意两个boxes的IoU超过IoU阈值的,lower score的box会被丢弃;这个过程一直迭代的进行,直到所有的boxes被处理;因此,NMS的执行时间依赖于box的数量和两个阈值;为了验证这个observation,利用YOLOv5和YOLOv8 for analysis。

  5. 首先,在过滤掉boxes with different confidence thresholds on the same input之后,数出boxes的数量,sample values from 0.001 to 0.25 as confidence thresholds,来计算出两个检测器的剩余的boxes的数量;结果显示:当confidence threshold增加的时候,更多的prediction boxes被过滤掉了,剩余需要来计算IoU的boxes的数量减小了,因此减小了NMS的执行时间。实验结果表明:anchor-free 的检测器有着相同精度的情况下,比anchor-based检测器好,因为anchor-free检测器需要很少的NMS time。

  6. Model:RT-DETR包含了一个backbone,一个高效的hybrid encoder,一个Transformer Decoder with auxiliary prediction heads。将最后3个stage的features给到encoder,混合encoder将多尺度的features通过intra-scale feature interaction和cross-scale feature fusion转换成image features的序列。用uncertainty-minimal query selection来选择固定数量的encoder features作为initial queries for decoder。最后,有着auxiliary prediction heads的decoder能够迭代优化object queries,来产生类别和boxes。

  7. multi-scale features的引入,加速了训练收敛,提高了性能。然而,尽管deformable attention降低了计算成本,增加了序列的长度,导致encoder变成了计算瓶颈。为了克服这个瓶颈,首先分析多尺度encoder中的计算冗余,直觉上来说,high-level的features包含了从low-level features提取出来的物体的丰富的语义信息,使得它在concated multi-scale features上进行feature interaction的时候是冗余的。因此,设计了不同类型的encoder,证明了intra-scale 和cross-scale feature interaction是不高效的。作者用了DINO-Deformable-R50 with the smalller size data reader 和ligher decoder used in RT-DETR for experiments,首先remove了multi-scale Transformer encoder in DINO-Deformable-R50作为变体A;然后不同类型的encoder基于A产生了不同的变体。

  8. 基于以上的分析,提出了混合的encoder,包含两个module:一个是Attention-based Intra-scale Feature Interaction(AIFI),CNN-based Cross-scale Feature Fusion(CCFF)。特别的,AIFI基于D,通过执行intra-scale interaction only on \(S_5\) 进一步降低了计算成本。原因是将自注意力操作用在high-level features with richer semantic concepts,能够得到conceptual entities之间的联系,方便了后续模块的目标的定位和识别。然而,lower-level feature的intra-scale interaction是不必要的,因为缺乏semantic concepts,还有重复以及和high-level feature混淆的风险。为了验证这个观点,执行了intra-scale interaction only on $S_5 in variant D。相比于D, \(D_{s5}\)不仅显著降低了延迟,也提高了精度。CCFF基于cross-scale fusion module进行优化,插入了几个包含卷积层的fusion blocks into fusion path。这个fusion block的角色是融合两个相邻的scale features into a new feature。fusion block包含了两个 \(1 \times 1\) 的卷积来调整通道。\(N RepBlocks\): \[ \mathcal{Q}=\mathcal{K}=\mathcal{V}=\mathtt{F l a t t e n} ( \mathcal{S}_{5} ), \\ \mathcal{F}_{5}=\operatorname{R e s h a p e} ( \mathtt{A I F I} ( \mathcal{Q}, \mathcal{K}, \mathcal{V} ) ), \\ \mathcal{O}=\operatorname{C C F F} ( \{\mathcal{S}_{3}, \mathcal{S}_{4}, \mathcal{F}_{5} \} ), \]

    Reshape表示恢复flattened feature的shape,使其和 \(S_5\)一样。

  9. Uncertainty-minimal Query Selection:为了减小优化object queries的困难,有人提出了query selection的方案。相同之处在于用confidence score来选择top K features from encoder,来初始化object queries(or just position queries)。confidence score表示包含前景目标的可能性。因为detector需要同时model the category和location of objects,这俩都决定了features的质量。因此feature的score是与分类和定位都相关的。当前的query selection会导致大量的不确定性,导致sub-optimal initialization for decoder,阻碍了detector的性能。为了解决这个问题,提出了不确定性最小的query选择方案,显示地构建和优化 epistermic uncertainty。具体地,feature uncertainty \(U\) 定义为预测的定位 \(P\) 和分类 \(C\) 的分布之间的差异。为了最小化queries的不确定性,将uncertainty集成到loss function中, for gradient-based optimization。

    \[\mathcal{U}(\hat{\mathcal{X}})=\|\mathcal{P}(\hat{\mathcal{X}})-\mathcal{C}(\hat{\mathcal{X}})\|,\hat{\mathcal{X}}\in\mathbb{R}^{D}\\\mathcal{L}(\hat{\mathcal{X}},\hat{\mathcal{Y}},\mathcal{Y})=\mathcal{L}_{box}(\hat{\mathbf{b}},\mathbf{b})+\mathcal{L}_{cls}(\mathcal{U}(\hat{\mathcal{X}}),\hat{\mathbf{c}},\mathbf{c})\]

    \(\hat{y}\)\(y\) 表示预测和真值,\(\hat{\boldsymbol{y}}=\{\hat{\mathbf{c}},\hat{\mathbf{b}}\}\), \(\hat{c}\)\(\hat{b}\) 分别表示类别和bounding box。\(\hat{x}\)表示encoder feature.

  10. Scaled RT-DETR:实时的检测器提供了不同scale的模型来适应不同的情景,RT-DETR支持灵活的scaling。特别地,对于混合encoder,通过调整嵌入的维度和channel的数量控制width,通过调整Transformer layers和RepBlocks来控制depth。decoder的width和depth可以通过调整object queries和decoder layers的数量来控制。更多的,通过调整decoder layers的数量,RT-DETR的速度支持灵活的调整。

Overview \(Fig.1^{[1]}\) The encoder structure for each variant. SSE represents the single-scale Transformer encoder, MSE represents the multi-scale Transformer encoder, and CSF represents cross-scale fusion. AIFI and CCFF are the two modules designed into our hybrid encoder.Variant B inserts a single-scale Transformer en coder into A, which uses one layer of Transformer block. The multi-scale features share the encoder for intra-scale feature interaction and then concatenate as output. Variant C introduces cross-scale feature fusion based on B and feeds the concatenated features into the multi-scale Transformer encoder to perform simultaneous intra-scale and cross-scale feature interaction. Variant D decouples intra-scale interaction and cross-scale fusion by utilizing the single-scale Transformer encoder for the former and a PANet-style structure for the latter. Variant E enhances the intra-scale interaction and cross-scale fusion based on D, adopting an efficient hybrid encoder designed by us.

CCFM \(Fig.2^{[1]}\) Overview of RT-DETR. We feed the features from the last three stages of the backbone into the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). Then, the uncertainty-minimal query selection selects a fixed number of encoder features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate categories and boxes.

Different types of encoders \(Fig.3^{[1]}\) The fusion block in CCFF