OpenVLA

发表于 2025-04-09 分类于 Papers 阅读次数：本文字数： 1.1k 阅读时长 ≈ 4 分钟

OpenVLA: An Open-Source Vision-Language-Action Model^[1]

作者是来自Stanford、UCB等机构的Moo Jin Kim等人。论文引用[1]:Kim, Moo Jin et al. “OpenVLA: An Open-Source Vision-Language-Action Model.” ArXiv abs/2406.09246 (2024): n. pag.

Time

2024.Sep

Key Words

Open model, pretrained on internet-scale vision-language datasets, and a visual encoder that fuses DINOv2 and SigLIP features.

总结

在internet-scale 上的vision-language 数据和diverse robot demo的结合上进行预训练的policies有潜力改变如何教robots学习new skills：而不是training new behaviors from scratch，可以对VLA models进行微调，来得到robust, generalizable policies for visuomotor control。当前的robotics的VLA挑战性在于：现有的VLAs大部分是闭源的，public无法接触；之前的工作没能探索高效微调VLAs for new tasks的方法。作者提出了OpenVLA，解决了上述的挑战，它是一个7B的open-source VLA，在970k real-world robot demo上的diverse 的collections上训练的。OpenVLA建立在Llama 2 上，结合了一个visual encoder，能够融合来自DINOv2和SigLIP的features。作为一个added data diversity和new model components的product，OpenVLA展示出了strong results for generalist manipulation, 超过了closed model例如RT-2-X，少了7x的参数。作者进一步展示出了，能够对new settings进行有效地微调，在涉及多个objects和strong language grounding abilities上的多任务环境中，展示出了很强的泛化性，超过了从零训练的imitation learning的方法，例如Diffusion Policy。

robotics manipulation的learned policies的一个缺点是：他们的能力不能够泛化到训练数据之外，用于individual skills 或者language instructions训练的existing policies，有能力外推behaviors 到一个新的initial conditions，例如object positions或者lighting，它们缺乏robustness to scene distractors或者novel objects，能够执行没有见过的task instructions。现有的视觉和语言的基础模型，例如CLIP，SigLIP，Llama 2，有这些泛化性，这来自于internet-scale pretraining datasets得到的先验。然而，在robotics上复现这种规模的预训练是一个open challenge，即使最大的robot manipulation datasets也只有 100k to 1M examples。这个不平衡suggest an oppotunity：用现有的vision 和language的foundation models作为core building block，用于training robotics policies，能够泛化到训练数据之外的objects、scenes和tasks。
朝着这个目标，现有的工作探索了集成预训练的language和VLM for robotic representation learning，作为一个component in modular systems for task planning和execution。另外，它们也用于直接学习VLA models for contrl。VLAs提供了一个直接初始化，用预训练的Vision-language foundation models for robotics，直接微调visually-conditioned language models(VLMs)，例如PaLI，来产生robot control actions，通过在internet-scale data上训练，建立的strong foundation models，VLAs例如RT-2展示出了很好的结果，能够泛化到novel objects和tasks，为通用的robot policies设置了一个新的标准。这有两个key reasons阻止了existing current models的广泛传播：当前的模型是closed，很难知道model 的脚骨、训练步骤和data mixture；现有的工作没有提供部署和将VLAs适应到new robots，environments， tasks的最好实践，特别是在消费级的硬件上。
OpenVLA包含一个预训练的visually-conditioned language model backbone，能够在多个粒度上得到visual features，在大的数据集上进行微调。作者是第一个展示了利用LoRA(低秩适应)的高效微调的方法
Octo训练一个generalist policy，能够控制多个robots out-of-the-box，对于new robot setups能够进行flexible fine-tuning。这些方法和OpenVLA之间的不同是模型架构。
VLA models：很多工作探索利用VLMs for robotics，for visual state representaiton， object detection, high-level planning, for providing a feedback signal,其它的将VLMs集成到e2d的visuomotor manipulation policies中，但是引入了significant structure into policy architecture or require calibreated cameras，限制了它们的applicability。大量最近的工作探索了类似的方法，直接在预训练的VLMs上进行微调，来预测robot actions，这样的models称之为VLA models，因为它们将robot control actions直接融合到VLMs backbones中。这样有三个好处：1) 在large internet-scale vision-language datasets上预训练的vision 和language components能够对齐; 2)通用的架构，而不是custom-made for robot control，使得能够利用scalable infra underlying modern VLM training; 3)提供了直接受益于快速发展的VLMs的pathway。

OpenVLA architecture \(Fig.1^{[1]}\) 给一个image observation和一个language instruction，模型预测7-dimensional robot control actions，架构包含三个key components：vison encoder concat DinoV2和SigLIP features，一个projector将visual features map到language embedding space; LLM backbone，一个Llama 2 7B的model.

OpenVLA: An Open-Source Vision-Language-Action Model[1]

Time

Key Words

总结

OpenVLA: An Open-Source Vision-Language-Action Model^[1]