Mr.DETR

发表于 2025-05-07 更新于 2025-05-08 分类于 Papers 阅读次数：本文字数： 3k 阅读时长 ≈ 11 分钟

Mr.DETR: Instructive Multi-Route Training for Detection Transformers^[1]>

作者是来自Visual AI Lab、HKU和Meituan的Chang-Bin Zhang等人。论文引用[1]:Zhang, Chang-Bin et al. “Mr. DETR: Instructive Multi-Route Training for Detection Transformers.” ArXiv abs/2412.10028 (2024): n. pag.

Time

2025.Apr

Key Words

one-to-one, one-to-many assignments
Multi-route training
一句话总结：为了加速DETR-like model的收敛，一些方法采用了auxiliary training，作者这里提出了multi-training route的方法，用3个route，route-1用一个独立的FFN for o2m, route-2是primary route for o2o, route-3 为了提高不同route的queries的兼容性，采用了learnable queries作为instruction，然后进行instruction self-attention，其它的没啥。

总结

现有的增强detection transformer的方式是同故宫引入auxiliary one-to-many assignment。在这个工作中，作者将model视为一个multi-task framework，同时进行one-to-one和one-to-many predictions。作者在这两个训练目标中，研究了Transformer decoder中的每个component的作用，包括self-attention, cross-attention和FFN。作者的结果展示，decoder中的任何独立的component能够同时有效地学习targets，即使当一些component是共享的。这个发现促使作者提出了一个multi-route training paradigm, 一个primary route用于one-to-one prediction，两个辅助的training routes用于one-to-many prediction，作者通过一个新的instructive self-attention, 能够动态地和灵活地指导object queries 用于one-to-many prediction,增强training机制。这个辅助的routes在推理的时候是去掉的，确保对model架构和inference cost造成影响。

端到端的DETR，还有后续的研究，在object detection领域shine，不同于传统的方法，DETR-based的detectors消除了NMS的需求，采用one-to-one的分配方法用于监督训练，每个gt box和single prediction进行匹配。相比于传统object detectors的one-to-many的分配方法，允许每个gt box和多个predictions进行匹配，one-to-one 分配导致收敛慢，因为sparse supervision。为了加速DETR-like detectors的收敛，很多工作提出了辅助的训练方法，通过引入辅助的one-to-many assignment或者多组one-to-one assignment，来提高prediction localization的质量，特别地，DN-DETR，Group-DETR和DINO利用多个并行的axillary queries，和primary object query共享同一个transformer decoder。DETA发现，当model进行one-to-one prediction的时候，self-attention是必要的，但是当执行one-to-many prediction的时候，是不必要的。基于这个观察，DAC-DTR和MS-DETR通过显性地限制self-attention和cross-attention，来做one-to-one和one-to-many prediction，引入了one-to-many auxiliary training。作者将同时进行one-to-one和one-to-many prediction的model称为multi-task framework。然而，先前的工作只在single-task setting中检验每个component的作用。

在这个工作中，作者构建了一个multi-task framework，研究了decoder中的每个component的作用，然而，实验结果表明:当所有的components在两个tasks之间共享的时候，引入一个one-to-many assignment显著地退化了primary one-to-one prediction的性能。作者发现，任何独立的component显著地对primary route of one-to-one prediction有帮助，作者也探索了不同的auxiliary routes with two independent components，和单独的component有类似的性能，但是有更多的训练参数。

这个empirical finding激励作者构建要给multi-route training 机制，结合了multiple auxiliary traininng routes，每个是独立的component，特别地，primary route 用于one-to-one prediction，每个auxiliary route 是一个indepentent component for one-to-many prediction。作者用independent self-attention和FFN分别将auxiliary training和primary route进行集成。

为了进一步降低auxiliary training routes中的新的训练参数，增强不同routes的参数共享。作者提出了一个新的instructive self-attention mechanism。作者的方法包括3个training routes，一个primary route for one-to-one prediction，和两个带有instructive self-attention和independent FFN的auxiliary routes。第一个auxiliary route和primary route中的self-attention共享参数，但是引入了一个learnable token，称之为instruction token，在self-attention中，attach到input object queries。object queries和instruction tokens一起，经历self-attention，使得instruction token指导queries for one-to-many prediction。给定FFN是一个简单的两层MLP，作者直接在第二个auxililary route中采用independent FFN。在推理的时候，两个auxiliary training routes被去掉，确保model架构和inference time和基线模型保持一致。

总结：作者的三个贡献：
- 在multi-task framework中，作者展示了decoder中的任何独立的component能够同时有效地学习one-to-one和one-to-many targets，即使当其它的component是共享的。
- 基于以上的简洁，作者提出了multi-route training机制，用动态和灵活地指导object queries for one-to-many prediction的instructive self-attention来增强它。
- 作者做了大量的实验，证明了提高.
为了加速DETR的训练，一些研究引入了辅助的训练方法，例如：H-DETR利用Hungarian Matching来构建one-to-many matching by duplicating targets。DN-DETR和DINO设计了多组denoising queries来加速detection transformers，同样地，，Group-DETR利用一组learnable queries作为auxiliary input for transformer decoder, DQ-DETR通过动态地融合auxiliary queries建立了primary query group。StageInteractor引入了one-to-many matching，通过结合不同decoder layers的label assignment results。SQR-DETR通过重新利用previous decoder layerse的object queries来构建auxiliary training。Co_DETR通过采用多组查询并结合不同的匹配策略，生成多样化的监督信号。DAC-DETR开发了一个并行的decoder，通过消除self-attention来学习一个one-to-one prediction。MS-DETR提出了用one-to-many matching来监督cross-attention output，同时self-attention是通过one-to-one matching进行监督的。
One-to-One Training Objective：利用一个one-to-one training objective，DETRs实现了不需要NMS的端到端的detection，\(\mathbf{\overline{B}} = \{ \overline{b_0}, \overline{b_1}, \dots,\overline{b_t}\}\), \(\mathbf{\overline{S}} = \{ \overline{s_0}, \overline{s_1}, \dots, \overline{s_t}\}\) 分别表示gt boxes和对应的classes，这个possible prediction和gt pairs之间的matching cost 是通过考虑cls cost和box costs。Optimal matches是用bipartite matching决定的，表示为 \(\delta\)，one-to-one training objective表示为：

\[L = \sum_{i=0}^{t} L_{cls}(s_{\sigma(i)}, \bar{s}_i) + L_{box}(b_{\sigma(i)}, \bar{b}_i),\]

\(L{cls}\) 和 \(L_{box}\) 分别表示cls和bbox losses。
- One-to-many training objective：传统的object detectors通常用一个one-to-many assignment 策略，基于特定的criteria，将每个gt box分配给多个predictions，然后用NMS来去掉duplicated predictions。在作者的工作中，作者用了一个直接的one-to-many assignment策略，考虑localization quality和cls confidence，使其适合DETR-like detectors。特别地，predictions \((s_i, b_i)\) 和gt \((\bar{s}_t, \bar{b}_t)\) 之间的定义如下matching scores \(M_{ij}\):
\[M_{ij} = \alpha \cdot s_i + (1 - \alpha) \cdot \text{IoU}(b_i, \bar{b}_j),\]

IoU计算prediction box和gt box之间的IoU，给定positive candidates K的最大值和一个IoU阈值 \(\tau\)，positive predictions可以被决定。首先，选择匹配得分M最高的最多K个prediction作为positive predictions，然后，对于每个gt box，过滤掉IoU低于 \(\tau\)的predictions。
Multi-route Training: 作者旨在引入one-to-many assignment作为一个额外的训练策略，来增强detection transformers，首先，作者将带有auxiliary one-to-many prediction的detector作为一个multi-task framework，同时实现了one-to-one和one-to-many的predictions。作者研究了decoder中的每个component的作用，发现：
- 当所有的components在两个tasks之间共享的时候，引入one-to-many assignment显著地退化了primary one-to-one prediction的性能
- 作者期望，它是来自于两个tasks之间的interference。例如，一个predicted box在one-to-many分配中，可能被分配给一个positive prediction，但是，在one-to-one分配中分配给一个negative prediction。
- decoder中的任何独立的component受益于primary one-to-one prediction route，即使当其它的components是共享的。这些observations表明：任何独立的component是能够有效地掌握one-to-one和one-to-many training的goals，因此解决了两个tasks之间的conflict。因为共享组件可以为两个任务提取共同的特征线索，而独立组件则进一步区分不同任务的需求。
- 带有两个独立component的auxiliary training route没有比一个独立component的route效果更好
- 在多种结合不同不同的auxiliary training routes的变种中，将auxiliary route和独立的self-attention，独立的FFN结合起来，试下了最好的性能。
基于以上的findings，作者的方法包括3个training routes，共享object queries，classification和regression heads among three routes，route-2是用于one-to-one prediction的主要的route，和基线model相同。Route-1和route-3是auxiliary training routes，用于one-to-many predictions，在推理的时候是discard，因此，auxiliary training routes不影响model的结构。
Primary Route for One-to-one Prediction：架构和route-2的training objective是和baseline model一样的，对于route-2，给定object queries \(\mathbf{Q} = \{ q_0, q_1, \dots, q_{n-1} \}\) ，query output定义如下：

\[\hat{\mathbf{Q}}_2 = (\text{FFN}_{o2o} \circ \text{CA} \circ \text{SA})(\mathbf{Q}),\]

SA, CA和FFN分别表示self-attention、cross-attention和FFN，Route-2的query output是受 one-to-one assignment的监督。在推理的时候，Route-2是保留的，来实现one-to-one prediction，不需要额外的cost。
Auxiliary Route with Independent FFN：作者集成了要给auxiliary training route，称之为Route-1，引入了一个独立的FFN。Route-1中的\(FFN_{o2m}\) 和primary route共享所有的self-attention和cross-attention components。由于直接的架构和FFN高效地参数利用。route-1的query output是受one-to-many assignment的监督。
Auxiliary Route with Instructive Self-Attention：为了降低训练参数和增强和primary route的参数共享，作者提出了一个新的instruction 机制，这个机制指导object queries实现one-to-many prediction，route-3的query output是这样的： \[\hat{\mathbf{Q}}_3 = (\text{FFN}_{o2o} \circ \text{CA} \circ \text{InstructSA})(\mathbf{Q}),\]

InstructSA表示提出的instructive self-attention，和另外两个route的self-attention共享参数。输出的 \(\hat{\mathbf{Q}}_3\) 受one-to-many assignment的监督。
Instructive Self-Attention：Route-3是用一个instructive self-attention实现的，在object queries中引入了learnable instruction tokens \(Q^{ins}\)，构造了一个结合的 sequence \(\hat{\mathbf{Q}}\)。Self-Attention在这个combined sequence上进行。这个方法，设计用一组separate queries作为inputs，来加速one-to-many predictions，为了提高不同routes的object queries的兼容性，引入了learnable tokens，作为instruction，来解释prediction objectives。如图所示，这些instruction tokens通过addition引入共享的object queries，这种方法需要将指令标记（instruction tokens）的数量固定为与查询（query）数量相等。不同于addition 方法，作者的方法通过concatenation的方式采用instruction tokens，提供了灵活性。这种灵活性不仅体现在指令标记（instruction tokens）数量的可变性上，还允许这些学习到的标记通过自注意力机制动态地向目标查询（object queries）传递信息，instructive tokens对应的输出在self-attention之后就丢弃了，因为它不用于object localization。在这种方式下，instruction tokens为route-3提供了有效和可适应的guidance，使其在和primary route共享参数的同时，实现one-to-many predictions。
特别地，作者构建了 \(m\) 个 learnable tokens \(\mathbf{Q}^{\text{ins}} = \{ q_0^{\text{ins}}, q_1^{\text{ins}}, \dots, q_{m-1}^{\text{ins}} \}\),称之为 instruction tokens。初始地时候，这些instruction tokens通过concatenation attach到输入的sequence上，形成一个符合的input queries。

Different configurations of transformer decoder with auxiliary one-to-many training \(Fig.1^{[1]}\) SA: self-attention, CA: cross-attention, FFN: feed-forward network. o2o: one-to-one, o2m: one-to-many,

\(Fig.2^{[1]}\) 包含3个routes, 所有的3个routes共享相同的object queries和detection heads for cls和regression， route-2作为primary route for one-to-one prediction，和baseline model一样，route-1共享SA和CA，但是用一个独立的FFN for one-to-many prediction, route-3和primary route共享所有的components，引入了一个新的instructive SA, 通过将learnable instruction token加到object queries上，来指导它们和后续的network for one-to-many prediction，在推理的时候，route-1和route-3被丢弃。

\(Fig.3^{[1]}\) Various implementations of instructive self-attention.

Mr.DETR: Instructive Multi-Route Training for Detection Transformers[1]>

Time

Key Words

总结

Mr.DETR: Instructive Multi-Route Training for Detection Transformers^[1]>