Grounding DINO

发表于 2024-04-20 更新于 2024-11-24 分类于 Papers 阅读次数：本文字数： 2.8k 阅读时长 ≈ 10 分钟

Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection^[1]

作者是来自清华、HKUST、CUHKsz、IDEA研究院和MSRR的Shilong Liu, Zhaoyang Zeng, Tianhe Ren等人。论文引用[1]:

Time

2023.Mar

Key Words

extend DINO by performing vision-language modality fusion at multiple phases, feature enhances, language-guided query selection, and cross-modality decoder.
extend evaluation of open-set object detection to REC datasets.

总结

在文本中，作者提出了一个开集目标检测器，称之为Grounding DINO，通过将基于Transformer的检测器DINO和grounded pre-training结合起来，就能够检测任意的目标，例如类别名称和referring expressions。这个开机目标检测的关键解决方式是将语言引入了闭集目标检测器，用于开机概念泛化。为了有效地融合语言和视觉模态，作者将闭集目标检测器划分成3个阶段，提出了一个tight fusion solution，包含了一个feature enhancer，一个语言引导的query selection，和一个跨模态的decoder用于跨模态融合。之前的工作主要是在noval categories上评估开集目标检测，作者提出在referring expression comprehension上进行评估，用于objects specified with attributes。Grounding DINO在所有的3个设置上表现都很好，包括COCO数据集，LVIS，ODinW。

理解新的概念是视觉智能的基础能力。在这个工作中，旨在开发一种很强的系统，来检测由人类语言的输入指定的任意目标，称之为open-set object detection。这个任务有广泛的应用。
开集目标检测的关键是将语言引入没有见过的目标泛化。例如，GLIP将m目标检测reformulate为一个phrase grounding task，在目标region和language phrases之间引入了对比训练。对于heterogeneous datasets展示了很好的灵活性。尽管性能很好，但是GLIP的性能受限，由于它是基于传统的一阶段的检测器Dynamic Head。因为open-set和closed-set目标检测器是相关的，作者相信一个更强的closed-set 目标检测器能够result in 一个更好的开集检测器。
受基于Transformer检测器的启发，这里，作者提出了一个更强的，基于DINO的开集目标检测器，不仅有更好的目标检测性能，也能够通过grounded 预训练，将multi-level text信息集成到算法中。将该模型称之为 Grounding DINO。Grounding DINO有很多超过GLIP的优点，包括：它的Transformer的架构类似于语言模型，使得它能够很容易地处理图像和语言数据；基于Transformer的检测器展示了很强的利用大规模数据集的能力，DINO能够不需要用任何hard-crafted modules，实现端到端的优化，简化了grounding model 的设计。
现在存在的开集目标检测器是通过利用语言信息，扩展闭集检测器到开集的场景。一个闭集检测器通常有3个重要的modules：backbone for feature extraction；neck for feature enhancement; a head for region refinement。一个闭集检测器能够通过学习language-aware region embeddings，以至于each region 能够在language-aware 语义空间中被分类到novel categories，泛化到检测新型的目标。实现这个目标的关键是用contrastive loss between region outputs和 language features at the nexk/head outputs。为了帮助模型对齐跨模态的信息，在最后的loss阶段，一些工作尝试来融合特征。特征融合可以在三个阶段进行：neck; query initialization and head.
作者argue, pipeline中，更多的feature fusion能够使模型表现更好。对于开集检测，一个模型被给定一个图片和a text input，指定了目标类别或者特定目标。在这个例子中，a tight(early) fusion model更能够实现更好的性能。尽管概念简单，对于之前的工作，很难在所有的3个阶段都进行feature fusion。经典的detector例如Faster RCNN使得它很难和language information进行interact。不同于经典的detectors，基于Transformer的检测器DINO和 language blocks有着一致的structure。这个layer-by-layer的设计使得它能够很容易地interact with language information。在这个原则下，在neck，query initialization and head phases中设计了3个feature fusion 方法。更具体地说，通过堆叠self-attention，text-to-image cross attention and image-to-text cross attention设计了一个feature enhancer作为neck。然后开发了一个language-guided query selection方法，来初始化queries for head。设计了一个image和text cross attention layers，来boost query representations。这3个fusion phases有效地帮助模型实现了更好的性能。
虽然以上的提高在多模态学习上已经实现了，大多数现有的开集目标检测在objects of novel categories上评估它们的模型。一个重要的场景是，用目标的属性描述的目标，也应该考虑。in the literature，这个任务称之为Referring expression comphrehensionREC。它是一个紧密相关的领域，但是在之前的开集检测中被忽视了。在这个工作中，作者将开集检测扩展到支持REC，也在REC上评估了性能。
相关工作：
- Grounding DINO是建立在DINO之上的，DETR提出之后，有很多的改进，在过去的几年里，DAB-DETR引入了anchor-box as DETR queries for 更精确的box prediction。DN-DETR提出了一个query denosing的方法，来stabilizing bipartite matching。DINO进一步开发了包括contrastive denoising。然而，这样的检测器仅关注在闭集检测检测上，很难泛化到新的类别，因为预定义的有限的类别。
- 开集目标检测使用现有的bbox标注，旨在在language generalization的帮助下，检测任意的类别。OV-DETR用图像和被CLIP编码的text embedding作为 queries，来decoder the category-specified boxes in the DETR framework。GLIP将目标检测formulate成一个grounding problem，利用额外的grounding data来帮助学习aligned semantics at phrase and region levels。它展示了这样的formulation能够在全监督的检测benchmarks上，实现更强的性能。DetCLIP涉及large-scale image captioning datasets，用生成的pseudo labels来扩展知识库。生成的pseudo labels能够有效地帮助扩展泛化能力。
然而，之前的工作仅能够融合部分阶段的多模态信息，会导致语言泛化的次优，例如，**GLIP仅考虑在feature enhancement上的融合，OV-DETR仅injects language information at the decoder inputs。另外，REC 任务在评估中被忽视，这对于开集检测是很重要的。
Grounding DINO给定一个(image, text)对，输出多个object boxes和noun phrases对。例如，这个模型定位一只猫和一个桌子 from the input image，然后提取了“cat”和“table” from input text作为对应的标签。目标检测和REC任务可以对齐。跟随GLIP，将所有的类别的名字进行concatenate，作为input texts for object detection tasks。REC需要一个bbox用于text input。用有最大的scores的output object作为REC任务的输出。
Grounding DINO是一个双编码器单解码器的架构，它包含一个image backbone for image feature extraction，一个text backbone for text feature extraction，一个feature enhancer for image and text feature fusion，一个language-guided query selection module for query initialization，一个跨模态的decoder for box refinement。对于每个图像文本对，首先提取vanilla 图像特征和文本特征，两种特征给到feature enhancer module 用于跨模态的融合，在得到跨模态的文本和图像特征之后，用一个language-guided query selection module来从图像特征中选择跨模态的queries。像大多数的DETR的object queries一样，这些跨模态的queries会被送到跨模态的decoder，来从两种modal features中probe desired features。最后一个decoder输出的queries将被用来预测object boxes和提取对应的phrase。
Feature Extraction and Enhancer：给定一个图像文本对，用一个image backbone提取多尺度image features。然后用一个text backbone例如BERT，来提取text features。和之前的DETR的检测器一样，多尺度的features是从不同的blocks出来的，在提出vanilla image 和text feature之后，将它们给到feature enhancer，用于跨模态的feature fusion。feature enhancer 包含了多种feature enhancer layers，用可变性自注意力来增强image features，自注意力用于text features enhancers。受GLIP启发，增加了一个image-to-text 交叉注意力和一个text-to-image 交叉注意力，用于feature fusion。这些modules能够帮助对齐不同的模态的特征。
Language-Guided Query Selection：Grounding DINO旨在检测图像中的指定text的目标。为了利用input text来指导object detection，设计了一个language-guided selection module来选择features that are more relevant to the input text as decoder queries。language-guided query selection module输出 num_query indices。可以基于selected indices来初始化queries。Following DINO，用mixed query selection来初始化decoder queries。每个decoder query 包含两个部分：content part和positional part。将positional part formulate为dynamic anchor boxes，用encoder inputs进行初始化。content queries设置为learnable during training。
Cross-Modality Decoder：开发了一个cross-modality decoder来结合image和text modality features。每个跨模态的query送到一个自注意力层，一个image cross-attention layer结合iamge features，一个text cross-attention layer结合text features。一个FFN layer in each cross-modality decoder layer。每个decoder layer有一个额外的text cross-attention layer，因为需要将text information注入到queries中 for better modality alignment。
Sub-Sentence Level Text Feature：两种类型的text prompts在之前的工作中都探索了，作者称之为sentence level representation和word level representation。Sentence level representation将整个sentence编码成一个feature。如果一些phrases grounding data中的sentences有多个multiple phrases，它提取these phrases，丢弃其它的words。In this way，它去掉了words之间的influence，然而丢失了句子之间的fine-grained(细粒度的信息) information。World level representation能够编码多个category names with one forward，但是引入了不必要的类间的依赖，特别是当input text是多个category name以一种arbitrary order的方式进行concatenation的时候。如图所示，一些不相关的words在attention的时候进行interact。为了避免不想要的word interactions。引入attention masks to block attentions among unrelated category names，称之为sub-sentence level representation。它消除了不同类别间的影响，同时保持per-word features for fine-grained understanding。
Loss function：Following 之前的DETR的工作，用L1损失和GIOU loss for bbox 回归。Follow GLIP，用预测目标和language tokens之间的对比损失for classification。特别地，dot product每个query和text features来预测logits for each text token，然后计算focal loss for each logit。Box 回归和分类损失用于bipartite matching between predictions and ground-truths。然后计算ground truths和matched predictions with same loss components之间的损失。Following 类似DETR的模型，在每个decoder layer和encoder outputs之后增加一个auxiliary loss。

\(Fig.1^{[1]} The framework of Grounding DINO, We present the overall framework, a feature enhancer layer, and a decoder layer in block 1, block 2, block 3\)

Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection[1]

Time

Key Words

总结

Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection^[1]