Grounding DINO
Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection[1]
作者是来自清华、HKUST、CUHKsz、IDEA研究院和MSRR的Shilong Liu, Zhaoyang Zeng, Tianhe Ren等人。论文引用[1]:
Time
- 2023.Mar
Key Words
- extend DINO by performing vision-language modality fusion at multiple phases, feature enhances, language-guided query selection, and cross-modality decoder.
- extend evaluation of open-set object detection to REC datasets.
总结
- 在文本中,作者提出了一个开集目标检测器,称之为Grounding DINO,通过将基于Transformer的检测器DINO和grounded pre-training结合起来,就能够检测任意的目标,例如类别名称和referring expressions。这个开机目标检测的关键解决方式是将语言引入了闭集目标检测器,用于开机概念泛化。为了有效地融合语言和视觉模态,作者将闭集目标检测器划分成3个阶段,提出了一个tight fusion solution,包含了一个feature enhancer,一个语言引导的query selection,和一个跨模态的decoder用于跨模态融合。之前的工作主要是在noval categories上评估开集目标检测,作者提出在referring expression comprehension上进行评估,用于objects specified with attributes。Grounding DINO在所有的3个设置上表现都很好,包括COCO数据集,LVIS,ODinW。