BLIP2

发表于 2025-03-28 更新于 2025-03-29 分类于 Papers 阅读次数：本文字数： 2.9k 阅读时长 ≈ 11 分钟

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models^[1]

作者是来自Salesforce Research的Junnan Li等人，论文引用[1]:Li, Junnan et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” International Conference on Machine Learning (2023).

Time

2023.Jun

Key Words

一句话总结：BLIP-2是一个vision-language pretraining方法，bootstraps from frozen pretrained unimodal models，为了弥补modality gap，提出了Querying Transformer，用两个阶段进行预训练：第一阶段用一个frozen image encoder的vision-language representation learning；第二阶段是用一个frozen LLM的vision-to-language geneative learning stage.

总结

vision-and-language pre-training的成本由于端到端的large-scale models的训练，逐渐变得难以承受。本文提出的BLIP-2，通过现成的冻结预训练图像编码器和冻结的大型语言模型来引导视觉-语言预训练。BLIP-2 使用轻量级的查询变换器（Querying Transformer）来弥合模态间的差距，该Transformer分两个阶段进行预训练。第一阶段bootstraps vision-language representation learning from a frozen image encoder。第二个阶段是bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 在多个视觉语言任务上去得到了SOTA的性能。尽管有更少的需要训练的参数，实现了更好的性能。

Vision-language pre-training(VLP) 最近这些年有了快速的发展，大规模的pretrained models使得下游任务持续取得SOTA。然而，大多数的SOTA的vision-language models在预训练期间有很高的计算成本。vision-language research 是vision和language的intersection。因此，很自然地期望，vision-language models能够利用现有的unimodal models from vision and natural language communities。在本文中，作者提出了一个generic、compute-efficient VLP 方法，通过bootstrapping from off-the-shelf pretrained vision models and language models。Pre-trained vision models提供了高质量的visual representation, Pre-trained language models特别是LLMs，提供了strong language generation 和zero-shot transfer abilities。为了降低计算成本，抵消灾难性遗忘的问题，在预训练期间，单模态的pre-trained models被frozen。

为了利用pre-trained unimodal models for VP，关键是促使cross-modal对齐。然而，因为LLMs在unimodal pretraining的时候没有见过images，将它们freezing使得vision-language alignment具有挑战。在这个方面，下游的方法借助于image-to-text generation loss，作者展示了，这对于弥补modality gap是不够的。

为了实现frozen unimodal models的vision-language对齐，作者提出了Querying Transformer，用一个新的两阶段的pre-training策略进行预训练。Q-Former是一个轻量的transformer，用一些可学习的query vectors来提取visual features from the frozen image encoder，它作为一个information bottleneck between frozen image encoder和frozen LLM，将最有用的visual feature给到LLM，输出理想的text。在第一个pre-training阶段，执行vision-language representation learning，使Q-Former学习和text最相关的visual representation，在第二阶段，执行vision-to-language generative learning，通过将Q-Former的输出连接到一个frozen LLM，训练Q-Former，使得它输出visual representationn能够被LLM理解。

作者称该VLP框架为BLIP-2：Bootstrapping Language-Image Pre-training with Frozen unimodal models，BLIP-2的主要的优势如下：
- 有效地利用了frozen pre-trained image models和language models，用在两阶段预训练的Q-Former来弥补modality gap，两阶段的训练是这样的：representation learning stage和generative learning stage。BLIP-2在多个vision-language task实现了SOTA性能，包括visual question answering, captioning, and image-text retrieval
- 借助LLMs， BLIP-2能够通过提示词，来执行zero-shot image-to-text generation，遵循自然语言的指令，展现处理visual knowledge reasoning，visual conversation的能力。
- 由于frozen unimodal models和一个轻量的Q-Former的使用，BLIP-2是一个计算高效的方法。
视觉-语言预训练旨在学习多模态基础模型，从而在各种视觉和语言任务中提升性能。依赖于下游任务，提出了不同的模型架构，包括dual-encoder architecture, fusion-encoder architecture，encoder-decoder architecture等等。最近这些年提出了很多pre-training 的目标，逐步收敛到经过时间检验的:image-text contrastive learning、(masked) language modeling。大多数的VLP方法用large-scale image-text pair数据进行端到端的Pre-training。当model size增加的时候，pre-training会导致很高的计算成本。另外，对于端到端的预训练模型，利用现有的unimodal pretrained models是不灵活的
一些方法冻结了image encoder，早期的工作采用frozen object detector来提取visual features，最近的LiT用一个frozen pretrained image encoder for CLIP pretraining。一些方法冻结了language model，用LLMs的知识 for vision-to-language generation tasks。用frozen LLM的挑战是将visual features对齐到text space。为了实现这个目标，Frozen 微调一个image encoder，它的输出直接用作soft prompts for LLMs, Flamingo将一些新的cross-attention layers插入到LLMs中，来inject visual features，在billions of image-text paris上pretrains这些new layers。这些方法都采用language modeling loss，language model在image的条件下产生texts。

不同于现有的方法，BLIP-2能够高效的利用frozen image encoders和frozen LLMs 用于多个vision-language tasks，实现更好的性能。
作者提出了BLIP-2，是一个新的VLP方法，bootstraps from pre-trained unimodal models。为了弥补modality gap，作者提出了Q-Former pretrained in two stages：vision-language representation learning with a frozen image encoder和vision-to-language generative learning stage with a frozen LLM。
模型架构：作者提出了Q-Former作为trainable module，来弥补frozen image encoder和frozen LLM之间的gap。它从image encoder中提出固定数量的features，独立于input image resolution，如图所示，Q-Former包含两个transformer submodules，共享self-attention layers，一个image transformer，和frozen image encoder进行interacts，用于visual feature extraction，一个text transformer能偶作为text encoder和text decoder，作者create 一些可学习的query embeddings输入到image transformer中。queries之间通过self-attention layer进行interact，然后通过cross-attention layers和frozen image features进行interact，queries能够通过相同的self-attention layers和text进行interact。依赖于pretraining task，作者用different self-attention masks来控制query-text interaction。作者初始化Q-Former with the pre-trained weights of BERT，然而，cross-attention layers进行随机初始化。在实验中，作者用了32 queries，每个query有768 dimension，用 \(Z\) 表示output query representation，\(Z\) 的size比frozen image features的size小很多。这个bottleneck architecture 和pre-training objectives一起work，使得queries提取和text最相关的visual information。
在represetation learning阶段，将Q-Former和一个frozen image encoder连接起来，用image-text pairs进行pre-training。旨在训练Q-Former，使得queries能够学习提取和最有informative text的visual representation，受BLIP的启发，将pre-training objectives进行联合优化，共享相同的input format和model parameters，每个objective用不同的attention masking strategy between queries和text，来控制它们的interaction
Image-Text Contrastive Learning学习对齐image representation和text representation，使得mutual information最大化，通过对比positive pair和negative pairs的image-text相似度来实现。作者将image transformer输出的output query representation \(Z\) 和来自text transformer的 text represetation \(t\) 进行对齐，\(t\) 是 \([CLS]\) token的输出的embedding。因为 \(Z\) 包含多个output embeddings，首先计算每个query output 和t之间的pairwise similarity，然后选择最高的作为 image-text similarity。为了避免information leak，采用unimodal self-attention mask，queries和text不能看到彼此。由于frozen image encoder的使用，相比于o2o方法，能够fit more samples。因此，用in-batch negatives而不是momentum queue in BLIP。
Image-grounded Text Generation loss 训练Q-Former来产生texts，给定输入的images作为condition，因为Q-Former的架构不允许frozen image encoder和text tokens之间的直接的Interaction，generate text所需要的information必须首先被queries提取出来，然后通过self-attention layers给到text tokens。因此，queries被forced用来提取visual features，能够capture text的所有的信息。作者采用multimodal causal self-attention mask来控制query-text interaction，类似于UniLM中用的。queries能够相互attend，但是不能对text tokens进行操作。每个text tokens可以和所有的queries和它之前的所有的text tokens。作者用一个新的 \([DEC]\) token代替 \([CLS]\) 作为第一个token，来表示decoding task。
Image-Text Matching：旨在学习image 和text representation之间的fine-grained 对齐。**它是一个binary classification task，model用来预测一个image-text pair是positive(matched) or negative(unmatched)。作者用一个bi-directional self-attention mask，所有的queries和texts能够互相attend，输出的query embeddings \(Z\) 因此capture multimodal information。作者将每个output query embedding给到一个two-class linear classifier来得到一个logit，在所有的queries上将logits进行平均，作为output matching score。作者采用hard negative mining strategy来create informative negative pairs。
在generative pre-training阶段，将Q-Former(with the frozen image encoder attached) 连接到一个frozen LLM，来利用LLM的generative language capability。如图所示，用一个FC layer将output query embeddings \(Z\) project 到LLM的text embedding的同一个维度。projected query embeddings添加到input text embedding的前面。它们充当soft visual prompts，使大语言模型能够基于Q-Former提取的visual representation进行条件化处理。因为Q-Former经过了pretrained，来提取language-informative visual representation，作为一个information bottleneck，将最有用的information给到LLM，同时去掉不相关的visual information。这降低了LLM的负担来学习vision-language alignment，缓解了灾难性遗忘的问题。

作者尝试了两种类型的LLMs：基于decoder的LLMs，用language modeling loss进行预训练；frozen LLM用来产生text，在来自Q-Former的visual representation的条件下。对于encoder-decoder-based LLMs，用prefix language modeling loss进行预训练，将text分为两个parts，prefix text和visual representation进行concat，作为LLM的encoder的输入。suffix text用作generation target，用于LLM的decoder。
Pretraining data：用和BLIP一样的pretraining数据集，总共129M images，包括COCO，Visual Genome等。采用CapFilt 方法来为web images构造synthetic captions。用 \(BLIP_{large}\) captioning model来产生10 个captions，根据CLIP ViT-L/14 model产生的image-text similarity，对合成的captions和original web caption进行排序，每个image保留top-two captions作为training data，在每个pre-training step随机采样一个。
Pre-trained image encoder and LLM：对于frozen image encoder，探索了两种SOTA pre-trained vision transformer models：ViT-L/14 from CLIP; ViT-g/14 from EVA-CLIP。去掉ViT的最后一个layer，用倒数第二个layers作为output features，导致更好的performance，对于frozen language model，探索利用无监督训练的OPT model family for decoder-based LLMs，instruction-trained FlanT5 model family for encoder-decoder-based LLMs
Pre-training setting：在第一阶段预训练250k steps，第二阶段80k steps。用batch size 2320/1520 for OPT/FlanT5 in the second stage。在预训练期间，将frozen ViT's 和LLMs的参数转换成FP16，出了FlanT5用BFloat16。相比于32-bit models，作者没有发现性能退化。由于frozen models，pretraining相比于现有的VLP 方法在计算上更加友好。例如，用单个的 16-A100机器，作者的最大的model with ViT-g 和FlanT5-XXL第一阶段的训练少于6天，第二阶段的训练少于3天。预训练的超参数对所有的models是相同的。用AdamW优化器。

overview of BLIP-2 \(Fig.1^{[1]}\) 预训练一个轻量的Querying Transformer following 两阶段的策略，来弥补modality gap。第一个阶段从一个frozen image encoder来bootstraps vision-language representation learning。第二阶段从一个frozen LLM, bootstraps vision-to-language generative learninggg，使得能够zero-shot instructed image-to-text generation

Model Architecture \(Fig.2^{[1]}\) 联合优化3个objectives，使得queries能够提取和text最相关的visual representation。self-attention mask strategy for each objective来控制query-text interaction

BLIP-2 second stage generative pretraining \(Fig.3^{[1]}\) bootstraps from frozen large LLMs. (top) Bootstrapping a decoder-based LLM， (bottom) Bootstrapping an encoder-decoder-based LLm. FC layer将Q-former输出的dimension调整到chosen LLM的输入的dimension。

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[1]

Time

Key Words

总结

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models^[1]