Young's Blog

Text-guided Video MAE

发表于 2025-07-08 分类于 Papers 本文字数： 2.5k 阅读时长 ≈ 9 分钟

Text-guided Video Masked Autoencoder^[1]

作者是来自Amazon的David Fan等人，论文引用[1]:Fan, David et al. “Text-Guided Video Masked Autoencoder.” European Conference on Computer Vision (2024).

Time

2024.Aug

Key Words

一句话总结：captions/自然语言密集信息能够捕捉视频中的显著信息，不需要先验假设

总结

最近Video MAE的工作设计了改进的masking 算法，这些工作利用visual cues例如motion来mask 最突出的regions，然而，这些visual cues的robustness依赖于输入视频与底层假设的匹配程度，另一方面，natural language description是一个信息密集的representation，不需要modality-specific assumption，能够隐式捕捉视频中的显著性特征，这还没有别video MAE explore，出于这个目的，作者介绍了一个新的text-guided masking 算法，TGM，将和paired captions高度相关的video regions进行mask，不利用任务显式地visual cues for saliency。TGB是对于motion-guided masking是很有竞争力的，为了进一步利用自然语言的语义，用于masked reconstruction，接下来作者介绍了一个unified framework for joint MAE和masked video-text contrastive learning。作者展示了在现有的masking算法中，将MAE和masked video-text contrastive learning统一，相比于纯MAE，提高了下游任务的性能。

阅读全文 »

Qwen2.5-VL

发表于 2025-07-08 分类于 papers 本文字数： 1.5k 阅读时长 ≈ 5 分钟

Qwen2.5-VL Technical Report^[1]

作者是来自阿里的Qwen Team。论文引用[1]:Bai, Shuai et al. “Qwen2.5-VL Technical Report.” ArXiv abs/2502.13923 (2025): n. pag.

Time

2025.March

Key Words

dynamic resolution processing
window attention

总结

Qwen2.5 在基础能力和创新功能上有了很大的进步。Qwen2.5-VL的一个特点是能够精确地用bbox和points定位objects。为了处理复杂的输入，Qwen2.5-VL引入了dynamic resolution processing和absolute time encoding，使得它能够处理多种尺寸的images和很长时间的videos，模型能够感知到空间scales和temporal dynamics，不需要依赖于传统的normalization techniques。通过从零训练一个native dynamic-resolution ViT，引入Window Attention，能够大幅度降低计算开销，同时保持native resolution，因此，Qwen2.5-VL不仅擅长static images和document understanding，同时可以作为一个interactive visual agent，能够处理reasoning、tool usage和task execution。模型不需要task-specific fine-tuning，实现了strong generalization across domains。Qwen2.5-VL适合三种sizes，解决了多种use cases。

阅读全文 »

DiffiT

发表于 2025-06-26 分类于 Papers 本文字数： 778 阅读时长 ≈ 3 分钟

DiffiT: Diffusion Vision Transformers for Image Generation^[1]

作者是来自NVIDIA的Ali Hatamizadeh等人，论文引用[1]:Hatamizadeh, Ali et al. “DiffiT: Diffusion Vision Transformers for Image Generation.” European Conference on Computer Vision (2023).

Time

2024.Aug

### Key Words

总结

Diffusion models有很强的expressivity和高质量采样，在生成领域实现了SOTA，ViT展示出了很强的modeling capabilities，在本文中，作者研究了ViTs在diffusion-based generative learning中的有效性，提出了一个新的model称之为**Diffusion Vision Transformer(DiffiT), 作者提出了一个用于denoising 过程finegrained control的方法，引入了Time-dependant Multihead Self Attention 机制，DiffiT在生成高保真images上有很好的效果，作者也提出了latent和image spae DiffiT models，在不同的分辨率下，在多个class-conditional和unconditional合成任务中实现了SOTA。

阅读全文 »

Qwen3

发表于 2025-06-24 更新于 2025-07-08 分类于 Papers 本文字数： 2.1k 阅读时长 ≈ 8 分钟

Qwen3 Technical Report^[1]

作者是Qwen Team，论文引用[1]:Yang, An et al. “Qwen3 Technical Report.” (2025).

Time

2025.May

Key Words

thinking control

总结

Qwen3包含一系列的LLMs，Qwen3系列包含dense和MoE 架构，参数从0.6B到235B，**Qwen3中的一个关键创新是将了thinking mode(多步推理)和non-thinking mode(rapid, context-driven responses)集成到了一个框架中，同时，Qwen3引入了一个thinking budget 机制，使得用户可以在推理的时候灵活地分配computational resources，平衡延迟和性能。另外，通过利用旗舰model的知识，能够大幅度地降低计算资源。

阅读全文 »

ViC-MAE

发表于 2025-06-22 更新于 2025-06-24 分类于 Papers 本文字数： 1.6k 阅读时长 ≈ 6 分钟

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders^[1]

作者是来自Rice University和Google DeepMind的Jefferson Hernandez等人，论文引用[1]:Hernandez, Jefferson et al. “ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders.” (2023).

Time

2024.Oct

Key Words

MAE
contrastive learning
treat short videos as temporal augmentations

总结

作者提出了VIC-MAE，是一个结合了MAE和contrastive learning的model，VIC-MAE通过使用一种global representation进行训练的，该global representation是通过对在 MAE 重建损失下学习到的局部特征进行池化（pooling）得到的，并在图像与视频帧之间基于这一表示进行对比学习目标（contrastive objective）的训练。作者展示了在VIC-MAE下学习到的visual representation能够很好地泛化到video 和image 分类的任务，VIC-MAE相比于最近提出的OmniMAE，实现了SOTA的tranfer learning的性能。

阅读全文 »

c-jepa

发表于 2025-06-22 分类于 Papers 本文字数： 2.2k 阅读时长 ≈ 8 分钟

Connecting Joint-Embedding Predictve Architecture with Contrastive Self-supervised Learning^[1]

作者是来自CMU和NYU的 Shentong Mo和Shengbang Tong，论文引用[1]:Mo, Shentong and Shengbang Tong. “Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning.” ArXiv abs/2410.19560 (2024): n. pag.

Time

2024.Oct

Key Words

entire collapsing and mean of patch representation

总结

在最近的无监督视觉表征学习中，Joint-Embedding Predictive Architecture(JEPA) 通过创新的masking策略，用于从无标签的imagery重提取visual features。尽管它成功了，还有两个主要的限制：I-JEPA中使用的EMA无法有效阻止模型特征表征的完全崩溃，它的预测在准确学习patch representations的mean方面也存在不足。本文引入了一个新的框架，称之为C-JEPA(Contrastive-JEPA)，将Image-based Joint-Embedding Predictive Architecture和Variance-Invariance-Covariance Regularization(VICReg)策略集成到一起，这个结合用于高效地学习variance/covariance，用于阻止整个的崩溃和确保augmented views的mean的invariance，克服了这些局限。

阅读全文 »

BERT

发表于 2025-06-17 本文字数： 0 阅读时长 ≈ 1 分钟

SmolVLA

发表于 2025-06-10 分类于 Papers 本文字数： 1.6k 阅读时长 ≈ 6 分钟

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics^[1]

作者是来自Hugging face， Sorbonne University等机构的Mustafa Shukor等人，论文引用[1]:Shukor, Mustafa et al. “SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.” (2025).

Time

2025.Jun

Key Words

action expert with flow matching
SmolVLM-2
skip computations

总结

在大规模多模态数据集上预训练的VLMs编码丰富的visual、linguistic knowledge，使得它们称为robotics的strong foundation，不同于从零开始训练robotic policies，最近的方法将VLMs改为VLA models，使得能够natural language-driven perception 和control，然而，现有的VLAs太大了，通常billions的参数，导致很高的训练成本和有限的实际的部署。另外，它们依赖于学术和工业数据集，忽视了从affordable robotic platforms搜集到的data，在这个工作中，作者提出了SmolVLA，是一个小的，高效的、community-driven VLA，极大地降低了训练和推理成本，同时保持了competitive 的性能，SmolVLA在单个GPU上训练，然后可以部署在消费级的GPUs上，为了进一步提高responsiveness，作者引入一个异步推理堆栈，将感知和动作预测与动作执行解耦，从而通过分块生成动作实现更高的控制率。尽管compact size，SmolVLA实现了比它大10倍的VLA相当的性能。

阅读全文 »

RT2

发表于 2025-06-09 分类于 Papers 本文字数： 2.5k 阅读时长 ≈ 9 分钟

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control^[1]

作者是来自DeepMind的Anthony Brohan等人，论文引用[1]:Brohan, Anthony et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” ArXiv abs/2307.15818 (2023): n. pag.

Time

2023.July

Key Words

map robot observations to actions
利用在language 和vision-language data大规模预训练的benefits
将actions表示text tokens
co-fine-tuning

总结

作者研究了在Internet-scale data上进行训练的VLM是如何直接引入到端到端的robotic control中，来提高泛化性和enable semantic reasoning。作者的目标是使得一个single端到端的trained model能够学习将robot observations映射到actions，然后能够利用在language和vision-language data from the web大规模预训练的优势，为了这个目的，作者提出了co-fine-tune的SOTA的VLMs on both robotic trajectory data和internet-scale vision-language tasks，例如VQA，相比于其它的方法，作者提出的方法简单，通用，能够实现这个目标，为了将自然语言的responses和robotic actions统一到相同的格式，作者将actions表示为text tokens，将它们引入model的training set，和自然语言的tokens一样，作者将这种类型的modals称之为VLA，并构建了该类模型的一个具体实例，命名为RT-2。作者的大量的evaluation表明：这个方法有很好的robotic policies，使得RT-2能够从internet-scale training上得到emergent capabilities，这包括繁华到新目标的能力，将没有出现在训练数据中的指令进行解译，对用户的指令进行基本的推理的能力，作者进一步展示了：引入了思维链的推理使得RT-2能够执行多阶段的语义推理，例如，确定将哪个物体作为临时锤子使用（如石头），或判断哪种饮品最适合疲惫的人（如能量饮料）。

阅读全文 »

Octo

发表于 2025-06-09 更新于 2025-06-10 分类于 Papers 本文字数： 2.8k 阅读时长 ≈ 10 分钟

Octo: An Open-Source Generalist Robot Policy^[1]

作者是来自UCB、Stanford、CMU和DeepMind的Sudeep Dasari等人，论文引用[1]:Team, Octo Model et al. “Octo: An Open-Source Generalist Robot Policy.” ArXiv abs/2405.12213 (2024): n. pag.

Time

2024.May

Key Words

input tokenizer
transformer backbone
readout head: diffusion denoising
能够微调到new robot setups

总结

在多种robot datasets上预训练的large policies有潜力转变robotic learning: 不是从零开始训练新的Policies，这样的通用的robot policies只需要一点的in-domain data就能fine-tune，泛化地很好。然而，为了在多个robotic learning的场景的应用，这样的policies需要处理多个sensors和action spaces，适应常用的robotic platforms，有效地微调到新的domains，在本文中，作者旨在为开发开源的、广泛应用的和通用的robotic manipulation奠定基础。作者提出了Octo，是一个大的基于transformer的policy，从Open X-Embodiment dataset上的800k trajectories上训练的，这是目前最大的robot manipulation dataset，可以通过language command或者goal images进行instruct，能够用新的sensory inputs和action spaces，在标准的消费级GPUs上有效的微调。
阅读全文 »

Text-guided Video Masked Autoencoder[1]

Time

Key Words

总结

Qwen2.5-VL Technical Report[1]

Time

Key Words

总结

DiffiT: Diffusion Vision Transformers for Image Generation[1]

Time

### Key Words

总结

Qwen3 Technical Report[1]

Time

Key Words

总结

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders[1]

Time

Key Words

总结

Connecting Joint-Embedding Predictve Architecture with Contrastive Self-supervised Learning[1]

Time

Key Words

总结

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics[1]

Time

Key Words

总结

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control[1]

Time

Key Words

总结

Octo: An Open-Source Generalist Robot Policy[1]

Time

Key Words

总结

Text-guided Video Masked Autoencoder^[1]

Qwen2.5-VL Technical Report^[1]

DiffiT: Diffusion Vision Transformers for Image Generation^[1]

Qwen3 Technical Report^[1]

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders^[1]

Connecting Joint-Embedding Predictve Architecture with Contrastive Self-supervised Learning^[1]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics^[1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control^[1]

Octo: An Open-Source Generalist Robot Policy^[1]