Qwen2.5-VL

发表于 2025-07-08 分类于 papers 阅读次数：本文字数： 1.5k 阅读时长 ≈ 5 分钟

Qwen2.5-VL Technical Report^[1]

作者是来自阿里的Qwen Team。论文引用[1]:Bai, Shuai et al. “Qwen2.5-VL Technical Report.” ArXiv abs/2502.13923 (2025): n. pag.

Time

2025.March

Key Words

dynamic resolution processing
window attention

总结

Qwen2.5 在基础能力和创新功能上有了很大的进步。Qwen2.5-VL的一个特点是能够精确地用bbox和points定位objects。为了处理复杂的输入，Qwen2.5-VL引入了dynamic resolution processing和absolute time encoding，使得它能够处理多种尺寸的images和很长时间的videos，模型能够感知到空间scales和temporal dynamics，不需要依赖于传统的normalization techniques。通过从零训练一个native dynamic-resolution ViT，引入Window Attention，能够大幅度降低计算开销，同时保持native resolution，因此，Qwen2.5-VL不仅擅长static images和document understanding，同时可以作为一个interactive visual agent，能够处理reasoning、tool usage和task execution。模型不需要task-specific fine-tuning，实现了strong generalization across domains。Qwen2.5-VL适合三种sizes，解决了多种use cases。

作者的主要的贡献有：
- 在visual encoder中采用了window attention，来优化推理效率。
- 引入了FPS sampling，将dynamic resolution扩展到temporal dimension，在多种sampling rate下，实现全面的video understanding。
- 通过和absolute time对其，升级了MRoPE，促进了复杂的temporal sequence learning
- 在组织高质量数据用于pre-training和supervised fine-tuning中，做了大量的努力
Qwen2.5-VL的架构主要包括以下三个部分：
- LLM： Qwen2.5-VL采用LLM作为它的基础component，模型用Qwen2.5-LLM预训练的weights进行初始化，为了更好地满足多模态理解的需求，作者将1D RoPE修改为Multimodal Rotary Position Embedding，和absolute time对齐。
- Vision Encoder: Qwen2.5-VL的vision encoder采用了一个重新设计的ViT，引入了2D-RoPE和Window attention，来支持原生输入分辨率，input images的高度和宽度被resize为多个28。
- MLP-based Vision-Language Merger: 为了解决image features的long sequences的效率问题，在将它们给到LLM的时候，采用了一个简单有效的方式来压缩feature sequences。特别地，没有用raw patch features，首先将four patch features空间相邻的sets，grouped features然后被concate，通过一个two-layer MLP，将它们project到一个dimension，然后和text embeddings对齐，这个方法不仅降低了计算开销，也提供了一个flexible way，来动态地压缩不同长度的feature sequences。
vision encoder在MLLMs中扮演很重要的位置，为了解决训练和推理期间由于native resolution inputs造成的computational load imbalances的问题，作者重新设计了ViT。在处理多种size的images的时候，会有平方的计算复杂度，为了缓解这个问题，作者引入了windowed attention，确保commputational cost和patches的数量是线性的，而不是平方的。在作者的架构中，仅有4个layers采用 full self-attention，同时其它的layers利用了windowed attention，最大的window size是 \(112 \times 112\)。小于 \(112 \times 112\) 的regions不需要padding处理，保持原有的resolution。这个设计使得model能够在输入resolution上进行操作，避免了不必要的scaling 或者distortion。对于positional encoding，作者采用了2D Rotary Positional Embedding来得到2D space中的spatial relationship，另外，为了更好地处理video inputs，作者将它们的方法扩展到3D patch partitioning，具体地，用了 \(14 \times 14\) patches作为基本unit，和传统的ViTs 一致，用于static image。对于video data，两个连续的frames被分为一组，极大地降低了输入到Language Model中的tokens的数量，这个设计不仅保持了和现有架构的兼容，也提高了处理视频data的效率。

为了streamling整个的network structure，作者采用了RMSNorm for normalization，还有SwoGLU 作为activation function。训练的时候，将重新设计的ViT从零开始训练，训练过程包含多个stages，包括CLIP pre-training、vision-language alignment和end-to-end finetuning。为了确保多种input resolutions的robustness，作者采用了dynamic sampling at native resolution。images根据它们原始的apsect ratios进行随机采样，使得model能够泛化到多种resolution的输入。这个方法提高了model的适应能力，同时也确保了不同size的data的稳定和高效地训练。
Multimodal Rotary Position Embedding Aligned to Absolute Time: positional embeddings对于vision和language modality中建模sequential data是重要的，基于Qwen2-VL中的MRoPE，作者将它的能力进行扩展，能够更好地处理temporal information in videos。

Qwen2-VL中的MRoPE将position embedding分解为三个不同的components：temporal、height和width，有效地model multimodal inpust，对于textual inputs，所有的三个components用来相同的position IDs，使得MRoPE功能上等价于传统的1D RoPE，对于images，temporal ID 保持constant，unique IDs基于每个tokens的spatial position，分配给height和和width components。当处理video的时候，被视为sequences of frames，每一帧temporal ID是增加的，同时height和width components和static images是相同的处理。

在Qwen2-VL中， temporal position IDs和input frames相关联，这种方法未考虑视频中内容变化的速度或事件的绝对时间。为了解决这个limitation，Qwen2.5-VL引入了一个Key improvement：将MRoPE的temporal component 和absolute time进行对齐，利用temporal IDs之间的intervals，model能够学习不同FPS sampling rates的videos的consistent alignment。

\(Fig.1^{[1]}\) Qwen2.5-VL的框架是一个vision encoder和一个language model decoder的结合，来处理多模态的inputs，包括images、videos，vision encoder用来处理原生分辨率的inputs，支持dynamic FPS sampling。不同sizes的images和不同fps的video frames被动态地map到不同长度的token sequences，注意到，MRoPE在temporal dimension，将time IDs和absolute time对齐，使得模型能够更好地理解temporal dynamics，例如pace of events和精确地moment localization。处理后的visual data送到Qwen2.5 LM decoder中。

Qwen2.5-VL Technical Report[1]

Time

Key Words

总结

Qwen2.5-VL Technical Report^[1]