LW-DETR

发表于 2024-09-16 更新于 2024-10-23 分类于 Papers 阅读次数：本文字数： 1.6k 阅读时长 ≈ 6 分钟

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection^[1]

作者是来自百度、阿德莱德大学、北航、自动化所和澳洲国立大学的Qiang Chen,Xiangbo Su, Xinyu Zhang等人。论文引用[1]:

Key Words

Real-Time Detection With Transformer
interleaved window and global attention
window-major order feature map organization

Time

2024.Jun

总结

作者提出了一个light-weight transformer, LW-DETR，在实时检测上超过了YOLOs，这个架构是简单地ViT encoder、projector、和一个浅的DETR decoder的堆叠。这个方法利用了最近的技术包括training-effective techniques：improved loss和预训练，interleaved window 和global attention用来减小ViT encoder的复杂度。通过汇聚多个level的feature maps、intermediate 和final feature mapss来提高ViT encoder，形成更丰富的特征图，引入window-major feature map，来提高interleaved attention计算的效率。结果展示提出的方法超过了现有的检测器，包括YOLO和它的变体。

实时目标检测有很广的真实的应用。当前的解决方案是基于卷积网络的，例如YOLO系列。最近，Transformer的方法，例如DETR，有很多的进步，然而，DETR的实时的检测还没有fully explored。还不清楚它的性能和SOTA的卷积相比怎么样。本文提出了light-weight DETR用于目标检测，架构很简单，一个plain ViT encoder，一个DETR的decoder connected by 一个卷积projector。提出了汇聚mulit-level feature maps，中间层和最后的feature maps in the encoder，形成了更强的encoded feature maps。这个方法利用了高效的训练techniques。例如：用了deformable cross-attention 来构成encoder。IoU-aware classification loss， encoder-decoder pretraining策略。另一方面，这个方法利用了inference-efficient techniques。例如：采用了interleaved window和global attentions。用window attention代替了global attentions in the plain ViT encoder来减小复杂度。通过一个window-major feature map organization方法，来做一个高效的implementation for the interleaved attentions，能够有效地减小存储的permutation操作。
LW-DETR是由一个ViT encoder, 一个projector和一个DETR decoder构成。
- Encoder：采用ViT 用作detection encoder，一个plain ViT 包括一个patchification layer和transformer encoder layers。由于全局的attention成本很高，在一些encoder layers中用 window self-attetion来减小计算复杂度。提出了汇聚多个level的feature maps，中间层和最后的feature maps in the encoder，形成更强的encoded feature maps。
- Decoder: decoder多个transformer decoder layers的堆叠。每个layer包括一个自注意力，交叉注意力和一个FFN。采用deformable cross-attention用于高效计算。DETR及其变体通常采用6个decoder layers。在我们的实施中，用了3个transformer decoder layers。这导致时间的减小。采用一个mixed-query selection scheme来形成object queries 作为额外的content queries和spatial queries。这个content queries是learnable embeddings，类似于DETR。这个spatial queries是基于两阶段的方案：选择top-K features from last layer in the projector,然后预测bounding boxes，将对应的boxes变成embeddings 作为spatial queries。
- Projector：用一个projector来连接encoder和decoder。这个projector将来自encoder的aggregated encoded feature maps作为输入，Projector是一个C2f block(an extention of cross-stage partial DenseNet)，用在了YOLOv8中。当构成LW-DETR的large和xlarge版本的时候，修改projector，来输出两个尺寸的feature maps，然后用multi-scale decoder。这个projector包含两个并行的C2f blocks。一个处理 \(\frac{1}{8}\) 的feature maps(它是通过一个deconvolution对输入进行上采样得到)。然后通过一个stride conv对输入进行下采样得到 \(\frac{1}{32}\) 的feature map，用另一个block对其进行处理。
- 目标函数：采用IoU-aware classification loss，IA-BCE loss, 这里 \(N_{pos}\) 和 \(N_{neg}\) 是正样本和负样本的数量。 \(s\)是预测的分类的score， \(t\)是目标score absorbing the IoU score \(u\)(with the ground truth), \(t = s^{\alpha}u^{1-\alpha}\)，\(\alpha\)设为0.25。整个的loss是分类的loss和bbox loss的结合。 bbox loss和DETR框架中的一样。
\[\ell_{\text{cls}}=\sum_{i=1}^{N_{pos}}\mathrm{BCE}(s_i,t_i)+\sum_{j=1}^{N_{neg}}s_j^2 \mathrm{BCE}(s_j,0),\]

\[ \ell_{\mathrm{c l s}}+\lambda_{\mathrm{i o u}} \ell_{\mathrm{i o u}}+\lambda_{\ell_{1}} \ell_{1}. \]

\(\lambda_{iou}\) 和 \(\lambda_{l1}\) 分别设为 2.0和5.0。\(l_{iou}\) 和 \(l_1\) 是generalized IoU(GIoU) loss。L1 loss for the box regression。
高效训练：
- More supervision：多个techniques用于引入更多的supervision，用来加速DETR的训练，采用Group DETR，很容易执行，不改变推理的过程。用了13个并行的weight-sharing decoders for training。对于每个decoder，产生object queries for each group from the output features of the projector。用primary decoder for the inference。
- Pretraining on Object365：预训练的过程包含两个阶段，在object365数据集上用MIM的方法进行预训练。另外，跟着之气的方法，retrain encoder，用监督的方式在object365数据集上训练projector和decoder。
高效推理：
- 在之前的方法上进行了简单第修改，采用了interleaved window和global attentions。用window attention代替一些global attention。例如在一个6-layer的ViT中，1，3，5用window attentions，window attention是通过将feature map划分成不重叠的windows，然后在每个windows内进行注意力操作。
- 采用一个window-major feature map organization用于高效的interleaved attention。能够将feature maps通过window by window的方式组织起来。ViTDet是将feature maps以row by row的方式组织的，需要很高的permutation operations，来将feature maps 从row-major transition到 window-major for window attention。作者的方法去掉了这些operations，减小了model latency。 window-major的解释如图：
\[\begin{bmatrix}f_{11} f_{12} f_{13} f_{14}\\f_{21} f_{22} f_{23} f_{24}\\f_{31} f_{32} f_{33} f_{34}\\f_{41} f_{42} f_{43} f_{44}\end{bmatrix},\]

window-major organization for a window size \(2 \times 2\)： \[f_{11},f_{12},f_{21},f_{22};f_{13},f_{14},f_{23},f_{24};\\f_{31},f_{32},f_{41},f_{42};f_{33},f_{34},f_{43},f_{44}.\]

这个organization对于window attention和global attention都是适用的，不需要rearranging the features。row-major organization是这样的：

\[f_{11},f_{12},f_{13},f_{14};f_{21},f_{22},f_{23},f_{24};\\f_{31},f_{32},f_{33},f_{34};f_{41},f_{42},f_{43},f_{44},\]

用于global attention是可以的，需要很多的permutation操作，来执行window attention。大的提升来自于在object365上的预训练，表明Transformer得益于大的数据。

encoder \(Fig.1^{[1]}\)

\(Fig.2^{[2]}\) Single-scale projector and multi-scale projector for (a) the tiny, small, and medium models, and (b) the large and xlarge models.

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection[1]

Key Words

Time

总结

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection^[1]