D-FINE

发表于 2025-03-28 更新于 2025-03-29 分类于 Papers 阅读次数：本文字数： 3.3k 阅读时长 ≈ 12 分钟

D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement^[1]

作者是来自USTC等机构的Yansong Peng、Hebei Li等人。论文引用[1]:

Time

2024.Oct

Key Words

iteratively refining probability distributions, fine-grained intermediate representation
transfers localization knowledge from refined distributions to shallower layers through self-distillation

总结

作者的D-FINE，是一个实时的object detector，通过在DETR models中重新定义regression task，实现了很好地定位效果。D-FINE包含两个key components：Fine-grained distribution refinement(FDR)，和Global Optimal Localization Self-Distallation(GO-LSD)。FDR将预测固定的坐标的回归过程变为iteratively refining probability distributions，提供了fine-grained的intermediate representation，能够增强localization的精度。GO-LSD是一个双向的优化策略，将来自refined distributions的localization knowledge，通过self-distillation转移到shallow layer，简化了residual prediction tasks for deeper layers。另外，D-FINE在计算密集的modules和操作中，引入了lightweight optimizations，实现了速度和精度的平衡。

实时检测的需求是日益增长的。DETR由于transformer的架构，有很多的优势，能够global context modeling，直接进行set prediction，不需要NMS和anchor boxes。然而，它们经常被高延迟和计算需求所阻碍。RT-DETR解决这些限制。LW-DETR展示了DETR能够实现比YOLO更高的performance ceilings。尽管有很多的进步，但是，还有很多没有解决的问题，限制了detectors的性能。一个key challenge是bbox regression的formulation。大多数现代的detectos通过回归fixed coordinates，将edges视为Dirac delta distributions建模的precise values，来预测bboxes，然而，这些方法不能model localization uncertainty。因此，models受限于L1 loss和IoU loss，不能提供充分的guidance for adjusting each edge。这使得优化过程对small coordinates changes很敏感，导致收敛慢和次优的性能。尽管像GFocal的方法，通过probability distributions解决了uncertainty的问题，仍然受限于anchor dependency、coarse localization和缺乏iterative refinement。另外一个challenge是实时检测器的效率的最大化，受限于有限的计算能力和parameter budeges。Knowledge Distillation是一个promising solution，将knowledge从larger teacheres转换到smaller students，来提高performance，不需要增加costs。然而，传统的KD方法例如logit mimicking和feature imitation证明了对于detection tasks是低效的，可能会造成performance drops。相比之下，localization distillation 展示出了better results for detection，然而，将LD集成，也是有挑战的，因为大量的training overhead，以及和anchor-free的detectors不兼容。
为了解决以上问题，作者提出了D-FINE，一个新的detector，重新定义了bbox regression，引入了有效的self-distillation strategy。作者的方法解决了fixed-coordiante regression的优化困难、model localization uncertainty以及用less training cost的高效的distillation的需要的问题。作者引入了Fine-grained Distribution Refinement(FDR)，将bbox regression从预测fixed coordinates，转换为modeling probability distributions，提供更fine-grained intermediate representation，FDR以residual 的方式，iteratively refine these distributions，使得progressively finer adjustments，提高了localization precision。认识到deeper layers通过capture更丰富的localization information within their probability distributions，产生更精确的predictions。作者引入了Global Optimal Localization Self-Distillation(GO-LSD)，GO-LSD将deeper layers的localization knowledge转换到shallower layers，只需要可忽略不计的training cost。通过对齐shallower layer's predictions with refined otuputs from later layers，model能够学习产生better adjustments，加速收敛，提高性能。另外，作者streamline 计算密集的modules和operations，使得D-FINE更快、更轻量，这些修改会导致性能下降，FDR和GO-LSD能够高效地缓解这个退化，实现速度和精度的平衡。
YOLO相关的方法依赖于NMS，引入了延迟和不稳定性。DETR去掉了hand-crated components like NMS和anchors。传统的DETRs实现了很好的行嗯，但是对计算的要求高，使得不适合实时应用。最近，RT-DETR和LW-DETR将DETR用于实时应用。YOLOv10去掉了NMS。
Distribution-based object detection：传统的bbox regression方法依赖于Dirac delta distributions，将bbox edges视为precise和fixed，使得model localization uncertainty 更具条找。为了解决这个，最近的models采用高斯或者离散分布来表示bboxes，增强了modeling of uncertainty。然而，这些方法都依赖于anchor-based框架，限制了和anchor-free的detectors的兼容。另外，这些distribution representations通常formulated in a coarse-grained manner，缺乏高效的refinement，阻止了实现更精确的predictions的能力。
Knowledge Distillation：Knowledge Distillation 是一个powerful模型压缩的方法。传统的KD聚焦于通过Logit Mimicking迁移知识。Fit-Nets提出了Feature imitation，启发了一系列的工作。大多数的DETR的方法，引入了hybrid distillation of both logit and various intermediate representations。最近，localization distillation展示了transfer localization knowledge对于detection tasks更有效。self-distillation是KD的一个特殊的case，使得earlier layers能够从model's own refined outputs中学习，需要很少的额外的training cost，因为没有必要单独训练一个teacher model。
传统的bbox regression 依赖于modeling dirac delta distributions，要么用centroid-bsed \({x, y, w, h}\)，要么是edge-distance {c, d}(distances d= {t,b,l,r}是从anchor point c= {x_c,y_c} measure得到的)。然而，Dirac delta假设，将bbox edges 视为precise和fixed，使得很难model localization uncertainty，特别是ambiguous cases，这个rigid representation不仅限制了优化，也导致，small prediction shifts的时候的significant localization errors。为了解决以上问题，GFocal从anchor points 回归distances to four edges，用离散的probability distributions，提供了更flexible modeling of bbox，实际上bbox distances \(d = {t, b, l, r}\) 建模为：

\[d = d_{\max} \sum_{n=0}^{N} \frac{n}{N} P(n)\]

\(d_{max}\) 是一个scalar，限制了从anchor center的最大距离，\(P(n)\) 表示four edges的candidate distance的概率。虽然GFocal通过概率分布的建模在处理模糊性和不确定性方面取得了进展，但其回归方法仍存在以下具体挑战：(1) 锚框依赖性：回归过程与锚框中心点绑定，限制了预测多样性并降低了与无锚框架的兼容性；(2) 缺乏迭代优化：预测通过单次回归完成，无法通过迭代优化提升回归鲁棒性。(3) Coarse Localization：固定的distance ranges和Uniform bin intervals导致coarse localization,特别是对于小目标，因为每个bin表示wide range of possible values。
Localization Distillation：是一个promising的方法，展示了transfer localization knoledge对于detection tasks是有效的。基于GFocal，它通过从teacher model中蒸馏valuable localization knowledge，增强了student models，而不是简单地mimicking classification logits 或者feature maps。尽管这些优势，方法仍然依赖于anchor-based 架构，导致额外的训练成本。
作者提出了D-FINE，是一个实时的object detector，D-FINE通过两个key components解决了现有bbox regression的缺点：Fine-grained Distribution Refinement(FDR)和Global Optimal Localization Self-Distillation(GO-LSD)。
- FDR iteratively优化probability distributions，作为bbox的prediction的corrections，提供了更fine-grained intermediate repressentation。这个方法captures和optimizes每个edge的uncertainty。通过利用non-uniform weighting function，FDR允许在每个decoder layer中进行更精确地和incremental的adjustments，提高了localization的精度，降低了prediction errors。FDR在一个anchor-free、端到端的framework中操作，使得更灵活和robust的优化过程。
- GO-LSD将来自refined distributions的localization knowledge蒸馏到shallower layers中，当训练的时候，final layer产生precise soft labels。Shallow layers通过GO-LSD将predictions和labels对齐，产生更精确的结果。这个mutual reinforcement 产生了协同作用，实现定位精度逐步提升。
为了进一步增强D-FINE的效率，streamline计算密集的modules和operations，使得D-FINE更快和更轻。尽管这些modifications导致性能下降，FDR和GO-LSD有效地缓解了这个退化。
Fine-grained Distribution Refinement：通过decoder layers，迭代优化fine-grained distribution。**一开始，第一个decoder layer 通过一个传统的bbox regression head和一个D-FINE head，预测preliminary bboxes和preliminary probability distributions，每个bbox和四个distributions相关联，one for each edge。初始的bboxes作为reference bboxes，接下来的layers通过residual 的方式，调整distributions，来refine them。这个refined distributions然后用来调整对应的初始的bbox的four edges，每次迭代逐步提高精度。

数学上，\(b^0 = {x, y, W, H}\)，表示初始的bbox prediction， \({x, y}\) 表示bbox预测的center， \({W, H}\) 表示box的width和height。将\(b^0\) 转换成 center coordinates \(c^0 = {x, y}\) ，edge distances \(d^0 = {t, b, l, r}\)，表示从center到四个edge的距离。对于第 l-th layer，refined edge distance \(d^l = {t^l, b^l, l^l, r^l}\) 计算如下：

\[\mathbf{d}^l = \mathbf{d}^0 + \{H, H, W, W\} \cdot \sum_{n=0}^{N} W(n) \operatorname{Pr}^l(n), \quad l \in \{1, 2, \ldots, L\}\]

\(\operatorname{Pr}^l(n) = \{\operatorname{Pr}_t^l(n), \operatorname{Pr}_b^l(n), \operatorname{Pr}_l^l(n), \operatorname{Pr}_r^l(n)\}\) 表示4个separate distributions，one for each edge。每个distribution预测对应edge的candidate offset values的概率。这个candidates通过weighting function \(W(n)\) 来决定。\(n\) 表示N个离散的bins的索引，每个bin对应一个potential edge offset。distributions的加权和产生edge offsets。这些edge offsets然后通过初始bbox的height H和width W进行scale，确保adjustments和box size成比例。refined distributions通过residual adjustments进行更新：

\[\operatorname{Pr}^l(n) = \operatorname{Softmax}\left(\logits^l(n)\right) = \operatorname{Softmax}\left(\Delta\logits^l(n) + \logits^{l-1}(n)\right),\]

来自前一层的\(logits^{l-1}(n)\)的logits反应了每个bin offset的value for four edge的confidence。当前layer预测的residual logits \(\delta logits^l(n)\)，加到previous logits上，得到updated logits \(\logits^l(n)\)。这些更新的logits然后用softmax进行normalized，得到refined probability distributions。

为了实现precise 和 weigting function \(W(n)\) 定义如下：

\[W(n) = \begin{cases} 2 \cdot W(1) = -2a & n = 0 \\ c - c \left(\frac{a}{c} + 1\right)^{\frac{N-2n}{N-2}} & 1 \leq n < \frac{N}{2} \\ -c + c \left(\frac{a}{c} + 1\right)^{\frac{-N+2n}{N-2}} & \frac{N}{2} \leq n \leq N-1 \\ 2 \cdot W(N-1) = 2a & n = N \end{cases}\]

\(a\) 和 \(c\) 是超参数，控制upper bounds和function的curvature。\(W(n)\) 的shape 确保bbox prediction是接近正确的，small curvature in \(W(n)\) allows for finer adjustments。相反，如果bbox的Prediction离正确的值差很多，边缘附近的较大曲率以及 \(W(n)\) 边界处的急剧变化确保了足够灵活性以进行显著的校正。

为了进一步提高distribution predictions的精度，将它们和gt values进行对齐，受Distribution Focal Loss的启发，提出了一个新的loss function，Fine-Grained Localization(FGL) Loss，计算如下：

\[\mathcal{L}_{\text{FGL}} = \sum_{l=1}^{L} \left( \sum_{k=1}^{K} \text{IoU}_k \left( \omega_{\gets} \cdot \text{CE}\left(\operatorname{Pr}^l(n)_k, n_{\gets}\right) + \omega_{\to} \cdot \text{CE}\left(\operatorname{Pr}^l(n)_k, n_{\to}\right) \right) \right)\]

\[ \omega_{\gets} = \frac{|\phi - W(n_{\to})|}{|W(n_{\gets}) - W(n_{\to})|}, \quad \omega_{\to} = \frac{|\phi - W(n_{\gets})|}{|W(n_{\gets}) - W(n_{\to})|}\]

\(\operatorname{Pr}^l(n)_k\) 表示对应于k-th prediction的probability distributions，\(\phi\) 是相对的offset，计算是这样的：\(\phi = (\mathbf{d}^{\text{GT}} - \mathbf{d}^0) / \{H, H, W, W\} \cdot \mathbf{d}^{\text{GT}}\)， \(d^{GT}\) 表示gt edge-distance，\(n_{\gets}, n_{\to}\) 是和 \(\phi\) 相邻的bin indices。cross-entropy(CE) loss weights \(w_{\gets}, w_{\to}\) 确保bins之间的interpolation和gt offset精确地对齐。通过引入IoU-based weighting，FGL loss让distributions wito lower uncertainty变得更concentrated，导致更精确和可靠的bbox regression。
Global Optimal Localization Self-Distillation(GO-LSD)利用最后一层的refined distribution predictions将localization knowledge蒸馏到shallower layers中，这个通过对每层的predictions应用匈牙利匹配开始，在model的每个阶段识别local bbox matches。为了执行global optimization，GO-LSD将所有层的matching indices汇聚到一个unified union set中，这个union set结合了最精确的candidate predictions，确保它们能够从所有的distillation process中受益，除了refining global matches，GO-LSD也在训练期间，优化unmatched predictions，来提高整体的stability，导致整体性能的提升。尽管localization通过这个union set优化，classification task仍然遵循one-to-one matching principle，确保没有冗余的boxes。这个严格的matching意味着，在union set中的一些Predictions是well-localized，但是有low confidence scores，这些low-confidence predictions通常代表candidates with precise localization，需要被有效地蒸馏。为了解决这个问题，作者引入了Decoupled Distillation Focal(DDF) Loss，用decoupled weighting strategies来确保high-IoU、但是low-confidence的predictions被给予合适的weight。这个DDF loss也对matched和unmatched predictions根据它们的quantity进行weight，平衡整体的contribution和individual losses。这个方法导致更稳定和有效的蒸馏。Decoupled Distillation Focal Loss \(L_{DDF}\) 计算如下：

\[\mathcal{L}_{\text{DDF}} = T^2 \sum_{l=1}^{L-1} \left( \sum_{k=1}^{K_m} \alpha_k \cdot \text{KL}\left(\operatorname{Pr}^l(n)_k, \operatorname{Pr}^L(n)_k\right) + \sum_{k=1}^{K_u} \beta_k \cdot \text{KL}\left(\operatorname{Pr}^l(n)_k, \operatorname{Pr}^L(n)_k\right) \right)\]

\[\alpha_k = \text{IoU}_k \cdot \frac{\sqrt{K_m}}{\sqrt{K_m} + \sqrt{K_u}}, \quad \beta_k = \text{Conf}_k \cdot \frac{\sqrt{K_u}}{\sqrt{K_m} + \sqrt{K_u}}\]

KL 表示Kullback-Leibler divergence，T 是temperature parameter，用于smoothing logits，k-th matched predictions的distillation loss通过 \(\alpha_{k}\) 进行加权，\(K_m\) 和\(K_u\) 是匹配的和未匹配的predictions的数量。对于k-th 个unmatched predition, weight是 \(\belta_k\)，\(Conf_k\) 表示classification confidence.

Overview of D-FINE \(Fig.1^{[1]}\) probability distribution作为more fine-grained intermediate representation，通过decoder layer，以residual 的方式iteratively进行迭代，用Non-uniform weighting functions实现更好的定位。

Overview of GO-LSD \(Fig.2^{[1]}\) 来自最后一层的refined distribution的Localization knowledge，通过带有decoupled weighting strategies，蒸馏到shallower layers中。

D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement[1]

Time

Key Words

总结

D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement^[1]