DINO Detection

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[1]

作者是来HKUST(Guangzhou)、HKUST、清华、IDEA研究院的Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M.Ni, Heung-Yeung Shum.论文引用[1]:Zhang, Hao et al. “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.” ArXiv abs/2203.03605 (2022): n. pag.

Time

  • 2022.Mar

Key Words

  • DETR
  • DeNoising Anchor
  • mixed query

动机

  1. DETR训练收敛很慢,meaning of queries不清楚。之前的DETR类的model的表现不如经典的detectors
  2. scalability of DETR-like models没有被研究过,no reported result about how DETR-like models perform when scaling to a large backbone and a large-scale data set.

总结

  1. DETR模型和之前的检测器不一样,它将目标检测视为一个集合预测的任务,assign labels by 双边的graph matching。 利用可学习的qeuries来probe目标的存在,结合图像feature map的特征,which behaves like soft ROI pooling. 尽管表现很好,DETR训练收敛很慢,meaning of queries不清楚,为了解决这个问题,提出了很多方法:引入deformable attention, decoupling positional and content information,providing spatial priors。DAB-DETR将DETR的queries作为dynamic anchor boxes(DAB),bridges gap between classical anchor-based detectors and DETR-like ones。DN-DETR通过引入denoising来解决双边matching的instability。DAB和DN的结合使DETR类型的model在训练的效率和推理的表现上与经典的检测器competitive。目前最好的检测器使基于DyHead和HTC的,
  2. 作者的改进:基于DAB-DETR,formulate queries in decoder ad dynamic anchor boxes and refine them step-by-step across decoder layers. 基于Following DN-DETR,加入了ground truth labels and boxes with noises into the Transformer decoder to help stablilize bipartite matching during training.同样也用了formable attention for 计算效率。提出了新的几个方法:
    • 为了提高one-to-one matching, 提出了contrastive denoising training by adding both positive and negative samples of the same ground truth at the same time.在adding two different noises to the same ground truth, 将box with smaller noise as positive and the other as negative. Contrastive denoising training helps the model to avoid duplicate outputs of the same target.
    • dynamic anchor box formulation of queries 将DETR类型的model与经典的两阶段的模型联系起来。提出了mixed query selection 方法, which helps better initialize the queries。
    • 为了leverage refined box information from later layers to help optimize the parameters of their adjacent early layers, 提出了look forward twice 来纠正updated parameters with gradients from later layers.
CDN
Framework