BLIP2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[1]
作者是来自Salesforce Research的Junnan Li等人,论文引用[1]:Li, Junnan et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” International Conference on Machine Learning (2023).
Time
- 2023.Jun
Key Words
- 一句话总结:BLIP-2是一个vision-language pretraining方法,bootstraps from frozen pretrained unimodal models,为了弥补modality gap,提出了Querying Transformer,用两个阶段进行预训练:第一阶段用一个frozen image encoder的vision-language representation learning;第二阶段是用一个frozen LLM的vision-to-language geneative learning stage.
总结
- vision-and-language pre-training的成本由于端到端的large-scale models的训练,逐渐变得难以承受。本文提出的BLIP-2,通过现成的冻结预训练图像编码器和冻结的大型语言模型来引导视觉-语言预训练。BLIP-2 使用轻量级的查询变换器(Querying Transformer)来弥合模态间的差距,该Transformer分两个阶段进行预训练。第一阶段bootstraps vision-language representation learning from a frozen image encoder。第二个阶段是bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 在多个视觉语言任务上去得到了SOTA的性能。尽管有更少的需要训练的参数,实现了更好的性能。