GLIP
Grouded Language-Image Pre-training[1]
作者是来自UCLA、Microsoft Reserach、UW等机构的Liunian Harold Li, Pengchuan Zhang等人。论文引用[1]:Li, Liunian Harold et al. “Grounded Language-Image Pre-training.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 10955-10965.
Time
- 2022.Jun
Key words
- object-level represetation
- 一句话总结:GLIP将detection转化为一个grounding tasks,通过将每个region/box和text prompt的phrases进行对齐,GLIP联合训练image和language encoder,来预测正确地regions/words的pairings。同时增加了两个modalities之间的fusion,来学习language-aware visual representation。
总结:
- 论文提出了一个grounded language-image pretraining model,用于学习object-level, language-aware和semantic-rich visual representations。GLIP统一了object detection和phrase grounding for pretraining。这个统一带来了两个好处: 1. 使得GLIP能够从detection和grounding data中学习,提高tasks和bootstrap a good grounding model. 2.GLIP通过self-training的方式,产生grounding boxes,能够利用大量的imag-text pairs,使得学习到的representations semantic-rich。