DN-DETR

发表于 2025-03-17 分类于 Papers 阅读次数：本文字数： 3.4k 阅读时长 ≈ 12 分钟

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising^[1]

作者是来自hkust等机构的Feng Li、Hao Zhang等人。论文引用[1]:Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., & Zhang, L. (2022). DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13609-13617.

Time

2022.Dec

Key Words

Denosing Training
一句话来说：作者发现DETRs方法收敛慢的一个主要原因在于bipartite matching，这个matching训练的时候不稳定。于是增加了denoising training for boxes和labels，能够加速收敛，提高了性能。

总结

作者展示了denosing training的方法，能够加速DETR的训练，提供了对于DETR-like方法的收敛慢的深刻的理解。作者展示了收敛慢是由于bipartite matching的不稳定造成早期阶段的不一致优化目标。为了解决这个问题，除了匈牙利loss，作者的方法额外的将带有噪声的GT bboxes给到Transformer decoder中，训练模型来重建original boxes，能够有效地降低bipartite graph matching的困难，导致更快的收敛。作者的方法是通用的，能够很容易地插入到任何DETR-like的模型中，实现很好的提升。

Object Detection是一个基础任务，旨在预测bboxes和objects的classes。相比于之前的detectors，DETR用可学习的queries来probe image features from the output of Transformer encoders，然后用bipartite matching来执行set-based box prediction。这样的设计有效地解决了hand-designed anchors和NMS，使得object detection端到端的优化。然而，DETR收敛速度非常慢，为了得到一个好的效果，通常许哟啊500个epochs，相比于原始的Faster RCNN的12个epochs。很多工作尝试去识别这个问题的根源，缓解收敛慢的问题。一些人尝试通过改进模型的架构来解决这个问题，例如，Sun等人将收敛慢归因于cross-attention的低效率，提出只有encoder-only DETR。有人设计了一个RoI-base dynamic decoder来帮助decoder关注RoI。很多最近的工作提出将每个DETR query和一个specific spatial position关联，而不是多个positions。例如，Conditional DETR将每个query解耦成content part和一个positional part，迫使query和一个spatial position对应。Deformable DETR和anchor DETR直接将2D reference points作为queries，来执行cross-attention，DAB-DETR将queries弄成一个4-D anchor boxes，逐层地学习提高。
尽管这些progress，很少工作注意到bipartite graph。在这个工作中，作者发现，收敛慢也来自于discrete bipartite graph matching，由于随机优化，其在早期训练是不稳定的。因此，对于同一张图像，a query通常对应于不同epochs里的不同的objects，因此，使得优化模糊和不一致。为了解决这个问题，通过引入query denoising task来帮助stabilize bipartite graph matching。先前的工作将queries作为reference points or anchor boxes展示除了effectiveness，包含了positional information，作者跟着它们的思路，用4D anchor boxes作为queries，作者的方法是将带有噪声的GT bboxes作为noised queries，和learnable anchor queries一起输入到Transformer decoder中，两种queries都用相同的输入格式(x,y,w,h)，能够同时给到Transformer decoders中。对于noised queries，执行denoising task来重建对应的GT boxes。对于其它的可学习的anchor queries，用同样的训练损失和bipartite matching和普通的DETR一样，因为noised bboxes不需要通过bipartite matching，denosing task能够被视为一个auxiliary task，帮助DETR缓解不稳定的离散双向匹配，更快地学习box prediction。同时，denoising task也帮助降低了优化困难，因为增加的随机噪声是很小的。为了最大化这个auxiliary task的潜力，将每个deocder 的query作为一个 bbox + class label embedding，因为能够构建box denoising和label denoising。
总结，作者的方法是一个denoising training的过程。作者的loss函数包含了两个components，一个是reconstruction loss，一个是匈牙利损失，和其它的DETR-like方法一样。作者的方法能够简单地插入到现有的DETR中，为了方便，用DAB-DETR来评估作者的方法，因为它们的decoder queries是显式地建模为4D anchor boxes。对于仅支持2D anchor points的DETR-like方法，例如anchor DETR，可以在anchor points上进行denosing。对于这些不支持anchor的方法例如普通的DETR，作者可以做线性变换，将4D anchor boxes映射到相同的latent space，用于learnable queries。
剧作者所知，这是第一个将denoising引入detection的工作。作者的贡献如下：
- 作者设计了一个novel training方法来加速DETR的训练，实验结果表明，作者的方法不仅加速了训练，也是了更好的训练结果。
- 作者分析了DETR收敛慢的原因，提供了一个对DETR训练的更深的理解。作者设计了一个metric来评估bipartite matching的不稳定，验证了作者的方法能够有效地降低不稳定。
基于DETR的方法也做了很多改进，Zhu等人设计了一个attention module，仅对sampling points around a reference point进行attend，Meng等人将每个decoder query解耦为一个content part和一个position part，仅利用content-to-content和position-to-position的terms in the cross-attention formulation。Yao等人利用一个Region Proposal Network来propose top-K个anchor points。DAB-DETR用一个4D- box coordiantes作为queries，以级联的方式逐层更新boxes。尽管有以上的进步，没有任何一个关注到用在匈牙利损失中的bipartite matcihng 是收敛慢的主要原因。Sun等人通过一个预训练的DETR作为teach来为student model提供GT label assignment，来训练student model，分析匈牙利损失的影响。作者发现，label assignment尽在早期的训练帮助了收敛，没有影响最后的效果。因此，它们得出结论：匈牙利损失不易收敛慢的主要原因。在这个工作中，作者有了一个有效的方式，提供了一个不同的分析，导致不同的结论。
作者采样DAB-DETR作为基础的detection 架构，来评估作者的训练方法，where label embedding appended with an indicator用来代替decoder embedding part，来支持label denoising。作者的方法和其它方法的主要是training method，除了匈牙利损失之外，作者还增加了一个denoising loss作为一个早期的auxiliary task，能够加速训练和提高性能。也有人用合成的noise objects增加sequence，但是不同于作者的方法。它们将targets of noise objects 设置为noise class，因此他们能够推迟End-of-Sentence token，提高recall。不同于它们的方法，作者将noised boxes的target设为original boxes，动机是绕过bipartite matching，直接学习近似gt boxes。
作者很高兴看到近期detection领域有很多models采用denoising training的方法，来加速detection的收敛，例如DINO、Mask DINO、Group DETR和SAM-DETR++。DINO进一步发展了作者的Denoising training，通过feeding hard-negative samples，训练model来reject them。因此，提出的Contrastive Denoising(CDN)进一步提高了性能。Mask DINO将denoising扩展到image segmentation，通过reconstructing masks from noised boxes。Group DETR和SAM-DETR++也采用denoising training in their model来实现更好地性能。这些models展示了有效性和方法的泛化性。
匈牙利算法是一个流行的算法，给定一个cost matrix，算法输出最优的匹配结果。DETR是第一个采用匈牙利匹配的，通过predicted objects和gt objects之间的匹配，来解决detection。DETR将GT assignment变成一个动态的过程，由于离散的bipartite matching，带来了不稳定的问题。这个工作展示了匈牙利损失因为blocking pairs的存在，不会造成stable matching。cost matrix中的一个小的change会造成matching result的巨大的变换，进一步导致不一致的优化目标。作者将DETR-like方法的训练过程视为两阶段，learning good anchors和learning relative offsets。Decoder queries负责learning anchors，anchors的不一致的更新使得学习relative offsets比较困难。因此，在作者的方法中，利用一个denoising task，作为一个training shortcut来使得relative offset learning变得简单。因为denoising task绕过了bipartite matching，因为将每个decoder query视为一个4D anchor box，一个noised query可以被视为一个good anchor，有对应的gt box。denoising training有一个清洗的优化目标 -- 预测original bbox，避免了匈牙利matching的模糊。
为了量化评估bipartite matching的不稳定，作者设计il一个metric。对于训练阶段，作者将predicted objects from Transformer decoders as \(O^i = {O^i_0, O^i_1,..., O^i_{N-1}}\) in the i-th epoch，N是predicted objects的数量。gt objects as \(T = {T_0, T_1, ..., T_{M-1}}\), M是gt objects的数量。在bipartite matching之后，计算一个index vector \(V^i = {V^i_0, V^i_1,..., V^i_{N-1}}\) 来存储epoch i的matching result。

\[V_n^i=\left\{\begin{array}{ll}m, & \mathrm{if} \ O_n^i \ \mathrm{matches} \ T_m \\ -1, & \mathrm{if} \ O_n^i \ \mathrm{matches}\ \mathrm{nothing} \end{array}\right.\]

\[IS^i=\sum_{j=0}^N\mathbb{1}(V_n^i\neq V_n^{i-1})\]
作者也展示了通过降低anchors和对应targets之间的distance，能够帮助detection，DETR 展示了它的positional queries有多个operating modes，使得一个query search from a wide region for a predicted box。然而，DN-DETR在初始的anchors和targets之间的有更小的mean distances。作者计算了mean \(l1\) distance between initial anchors and matched gt boxes。去噪训练训练模型来重建boxes from noised ones，模型会serach more locally for prediction，使得每个query 关注regions nearby，阻止了queries之间的潜在的conflicts。
作者基于DAB-DETR，来执行training，类似于DAB-DETR，作者显式地将decoder queries建模为box coordinates。作者的架构和它们的不同之处在于decoder embedding，被指定为class label embedding，来支持label denoising。类似于DETR，作者的架构包含一个Transformer encoder和一个Transformer decoder，在encoder side上，用CNN提取的image features送到encoder，来attain refined image features。在decoder上，queries送到decoder，来搜索objects through cross-attention。 decoder queries有两部分，一个是matching part，这个part的输入是可学习的anchors，这个parts的输入是可学习的anchors，和DETR中一样。另一个是denoising part，这个part的输入是noised gt box-label pairs，称之为GT objects。这个denoising part的输出的目标是重建GT objects。接下来，为了增加denoising的效率，作者提出用多个noised GT objects的版本 in denoising part。另外，作者用了一个attention mask来阻止denosing part的信息泄漏到matching part，
很多最近的工作将DETR queries和不同的positional information联系起来，DAB-DETR跟着这个分析，显式地将每个query formulate为4D anchor coordiantes。另外anchor coordiantes逐层地动态更新。每个decoder layer的输出包含一个tuple。
对于每个image，搜集所有的GT objects，在它们的bboxes和class labels上增加随机噪声，为了最大化denoising learning的utility，对于每个GT object用了多个noised versions。考虑在boxes上以两种方式增加noise：center shifting和box scaling。center shifting是给中心点增加噪声，使得 \(\delta{x}, \delta{y}\) 仍然在原始的bbox中，box scaling，设置一个超参数，box的width和height是随机sampled。对于label noising，采用label flipping，意味着随机flip 一些GT labels to other labels。Label flipping会force model来预测GT labels according to noised boxes，来更好地capture the label-box relationship。作者有一个超参数控制label flip的ratio。重建loss是 l1 loss和 GIOU loss for boxes, focal loss for class labels。denoising 只在训练的时候用，推理的时候不需要。
Attention mask是一个重要的组件，没有了attention mask，denoising training会牺牲性能，为了引入attention mask，首先需要将noised GT objects划分成groups，每个group是a noised version of all GT objects。attention mask 的是确保matching part不能看到denoising part，denoising groups不能看到each other。
decoder embedding 指定为label embedding，来support box denoising和label denoising。除了COCO2017中的80 classes，作者也考虑用在matching part中的未知的class embedding，来和denoising part保持语义一致。在label embedding上append一个indicator，如果query输入denoising part，则indicator为1，否则是0。
DN-Deformable-DETR：为了展示denoising training在其它attention design上的有效性，将denoising training集成到Deformable-DETR中，其它的设置和deformable DETR相同，但是将query指定为 4D boxes as in DAB-DETR，为了更好地用denoising training。
结论：在文章中，作者分析了DETR训练收敛慢的原因，其在与不稳定的bipartite matching和，提出了一个新的denoising training方法来解决这个问题，基于这个分析，通过结合denoising training into DAB-DETR，提出了DN-DETR，DN-DETR将decoder embedding指定为 label embedding,为 boxes和labels都引入了denoising training，通过也给Deformable DETR增加了denoising training，展示了它的泛化性。这个结果表明，denoising training明显地加速了收敛，提高了性能。这个研究表明denoising training能够被容易地集成到DETR-like models，作为一个general training方法。

Comparison of the cross-attention part DAB-DETR and DN-DETR \(Fig.1^{[1]}\) a. DAB-DETR 直接用dynamically updated anchor boxes to provide both a reference query point (x,y) and a reference anchor size (w,h), 来提高cross-attention computation。b. DN-DETR将decoder embeddings指定为 label embeddings，增加了一个indicator来区分denoising task 和matching task。

Overview of training method \(Fig.2^{[1]}\) there are two parts of queries，namely denoising part和matching part, denoising part contains >= 1 denoising groups, attention masks from the matching part to the denoising part and among denoising groups are set to 1(block) to block information leakage. In the figure, the yellow, brown adn green grids in the attention mask represent 0 and grey grids represent 1.

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising[1]

Time

Key Words

总结

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising^[1]