Young's Blog

BMViT

发表于 2024-10-28 更新于 2024-11-06 分类于 Papers 本文字数： 2.4k 阅读时长 ≈ 9 分钟

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization^[1]

作者是来自英国玛丽女王大学、三星剑桥AI center等机构的Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos。论文引用[1]:Ntinou, Ioanna, Enrique Sanchez, and Georgios Tzimiropoulos. "Multiscale vision transformers meet bipartite matching for efficient single-stage action localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Time

2024.May

Key Words

bipartite matching loss
Video Transformer with bipartite matching loss without learnable queries and decoder

总结

Action localization是一个挑战行的问题，结合了检测和识别，通常是分开处理的，SOTA方法都是依赖off-the-shelf bboxes detection，然后用transformer model来聚焦于classification task。这样的两阶段的方法不利于实时的部署。另外，单节段的方法通过共享大部分的负载来实现这两个任务，牺牲性能换取速度。类似DETR的架构训练起来有挑战。本文观察到：一个直接的bipartite matching loss可以用在ViT的output tokens上，导致一个 backbone + MLP 架构能够需要要额外的encoder-decoder head和learnable queries来同时处理这两个任务。用单一的MViTv2-S架构 with bipartite matching 来执行两个tasks，超过了MViTv2-S trained with RoI align on pre-computed bboxes。用设计的token pooling和提出的训练的pipeline，Bipartite-Matching Vision Transformer, BMViT。实现了很好的结果。

阅读全文 »

MRSN

发表于 2024-10-24 分类于 Papers 本文字数： 40 阅读时长 ≈ 1 分钟

MRSN: Multi-Relation Support Network for Video Action Detection^[1]

作者是来自南大的YinDong Zheng, Guo Chen, Minglei Yuan等。论文引用[1]:

Time

2023.Apr

Key Words

总结

STAR

发表于 2024-10-24 更新于 2024-11-07 分类于 Papers 本文字数： 2.2k 阅读时长 ≈ 8 分钟

End-to-End Spatio-Temporal Action Localisation with Video Transformers^[1]

作者是来自google的Alexey Gritsenko, Xuehan Xiong等人，论文引用[1]:Gritsenko, Alexey A. et al. “End-to-End Spatio-Temporal Action Localisation with Video Transformers.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 18373-18383.

Time

2023.Apr

Key Words

without resorting to external proposals or memory banks
directly predicts tubelets even do not have full tubelet annotations

总结

最好的时空行为检测使用额外的person proposals和复杂的external memory banks。作者提出了一个端到端的、纯transformer的模型，能够直接输入一个视频，输出tubelets(一系列的bboxes和action classes at each frame). 这个灵活的模型能够用稀疏的bbox supervision on individual frames or full tubelet annotations。在这两种情况下，预测连贯的tubelet作为输出。另外，这个模型不需要额外的对proposals的前处理或者NMS这样的后处理。

阅读全文 »

RT-DETR

发表于 2024-10-23 更新于 2025-03-17 分类于 Papers 本文字数： 2.4k 阅读时长 ≈ 9 分钟

DETRs Beat YOLOs on Real-time Object Detection^[1]

作者是来自北大和百度的Yian Zhao, Wenyu Lv等人。论文引用[1]:Lv, Wenyu et al. “DETRs Beat YOLOs on Real-time Object Detection.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 16965-16974.

Time

2024.Apr

Key Words

hybrid encoder to process multi-scale features
uncertainty-minimal query selection to provide high-quality initial queries to the decoder
intra-scale interaction and cross-scale feature interaction
一句话来说：利用AAFI和CCFF重新设计了encoder，AAFI仅在最后一个feature map上进行 with single-scale transformer encoder，CCFF是基于cross-scale fusion做优化。然后，在Query 的选择上提出了Uncertainty-minimal Query Selection，就是尽可能选择包含前景语义的queries，加了一个包含localizatioin和classification的loss，来优化uncertainty。

动机

YOLO系列受到了NMS的影响，会降低推理速度。在不同的scenarios下，需要仔细地选择NMS的阈值。DETR不需要手工设计的components，没有NMS，但是计算成本高。因此，探索DETR能够做到实时是一个重要的方向。

总结

YOLO系列变成了最流行的实施目标检测的框架因为trade-off between speed和accuracy。然而，观察到，YOLO的速度和精度收到了NMS的负面影响。端到端的DETR不需要NMS。然而，它的计算成本还是很高。这个不仅降低了推理的速度，也引入了超参数，造成速度和精度的不稳定。DETR去掉了手工设计的component，然而它的计算成本很高，很难做到实时。

阅读全文 »

YOWOv3_lightweight

发表于 2024-10-16 分类于 Papers 本文字数： 1.7k 阅读时长 ≈ 6 分钟

YOWOv3: A Lightweight Spatio-Temporal Joint Network for Video Action Detection^[1]

作者是来自江南大学的Anlei Zhu, Yinghui Wang等人。论文引用[1]:Zhu, Anlei et al. “YOWOv3: A Lightweight Spatio-Temporal Joint Network for Video Action Detection.” IEEE Transactions on Circuits and Systems for Video Technology 34 (2024): 8148-8160.

Time

2024.Sept

Key Words

Channel Fusion and Attention Convolution Mix(CFACM) module
spatio-temporal shift module
一句话总结：借鉴了前人的 ACmix(Attention Channel mix) 和 TSM(Temporal shift module) 模块，将它们拿过来缝合一下，完了。。。

总结

时空行为检测网络，需要同时提取和融合时间和空间特征，经常导致目前的模型变得膨胀和很难实时和部署在边缘设备上，文章引入了一个高效实时的时空检测网路模型，YOWOv3，这个模型用了3D 和2D backbone 来本别提取时间和时空特征，通过继承卷积和自注意力，来设计一个轻量的时空特征融合模块，进一步增强了时空特征的抽取。将这个模块称之为CFACM,这个方法不仅在lightness上超过了最新的高效的STAD模型，减小了24%的模型的大小，同时提高了mAP，保持了很好的速度。现有的模型用3D 卷积来提取时序信息，会受限于特定的设备上，为了减小3D卷积操作的潜在的不被支持在边缘部署的问题，用了一个时空shift module by 2D 卷积，使得模型能够获取时序信息，将得到的特征插入multi-level spatio-temporal feature extraction models.这样不仅将模型从3D 卷积操作中释放出来，同时也很好地balance了速度和精度。

阅读全文 »

Deformable-DETR

发表于 2024-10-08 更新于 2025-03-19 分类于 Papers 本文字数： 3.4k 阅读时长 ≈ 12 分钟

Deformable DETR: Deformable Transformers for End-to-End Object Detection^[1]

作者是来自商汤、中科大和CUHK的Xizhou Zhou, Weijie Su, Lewei Lu 等人，论文引用[1]:Zhu, Xizhou et al. “Deformable DETR: Deformable Transformers for End-to-End Object Detection.” ArXiv abs/2010.04159 (2020): n. pag.

Time

2021.Mar

Key Words

deformable attention module aggregates multi-scale feature maps via attention mechanism
一句话来说：Deformable DETR的核心是Deformable attention，用query + pos 生成 offset，sample locations 等于reference points + offset，然后对query做Linear得到Attention weight，对feature map做Linear得到value，然后用sample locations去取对应位置的value，将attention weight 乘以sampling value，然后将各个head的输出进行拼接，再经过Linear，得到最终的输出。

动机

DETR收敛慢
DETR检测小目标的性能比较低

总结

DETR在目标检测中，用来去掉很多人工设计的components，同时也有很好的性能。然而，它收敛比较慢，由于Attention在处理图像特征中的限制，特征的spatial resolution有限。为了缓解这些问题，提出了Deformable DETR，这个attention只对一个referrence附近的ke sampling points进行attend，Deformable DETR能够实现很好的性能，超越了DETR(特别是小目标)。

阅读全文 »

non-local

发表于 2024-10-08 更新于 2024-10-09 分类于 Papers 本文字数： 2.8k 阅读时长 ≈ 10 分钟

Non-local Neural Networks^[1]

作者是来自CMU和FAIR的Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He.论文引用[1]:Wang, X. et al. “Non-local Neural Networks.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 7794-7803.

Time

2017.Nov

Key Words

non-local 和 3D 卷积可以被视为将C2D扩展到时间维度的两种方式。
long-range dependencies
computes the response at a position as a weighted sum of the features at all positions
consider all positions

动机

计算长程依赖在神经网络中很重要，对于系列数据(speech, language)，循环操作是主流，对于图像数据，通过构建卷积的deep stacks，能够得到大的感受野，建模长程依赖。卷积和循环操作都处理局部相邻信息，either in space or time。因此，长程依赖只有当这些操作重复应用的时候才能捕捉到，通过数据逐步地propagating信号。重复Local operation有一些限制：首先，计算是不高效的；其次，造成了优化困难；最后，这些挑战造成了multihop dependency modeling。

阅读全文 »

Exploring Plain ViT for Object Detection

发表于 2024-09-26 更新于 2024-10-06 分类于 Papers 本文字数： 1.7k 阅读时长 ≈ 6 分钟

Exploring Plain Vision Transformer Backbones for Object Detection^[1]

作者是来自FAIR的Yanghao Li, Hanzi Mao, Ross Girshick和Kaiming He. 论文引用[1]:Li, Yanghao, et al. "Exploring plain vision transformer backbones for object detection." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.

Time

2022.Mar

Key Words

Plain ViT for Object Detection

总结

作者探索了plain, non-hierarchical ViT作为backbone，用于object detection，这个涉及使得原始的ViT架构能够被fine-tuned，用于object detection，不需要重新设计一个hierarchical backbone for pre-training。只需很小的adaptations for fine-tuning，这个plain-backbone detector能够实现很好的结果。作者观察到：
- 从单个尺度的feature map构建一个simple feature pyramid就足够了，不需要FPN的设计
- 用window attention(without shifting)，辅以很少的cross-window propagation blocks就足够了。
用经过MAE预训练的plain ViT backbone，detector称之为(ViTDet)，能够和之前的基于hierarchical backbone的leading methods竞争。

阅读全文 »

人类群星闪耀时

发表于 2024-09-23 分类于 Books 本文字数： 232 阅读时长 ≈ 1 分钟

人类群星闪耀时

这是斯蒂芬.茨威格的书，本科的时候看了《一个陌生女人的来信》，还有一些小说集，真心写的不错，茨威格的心思太细腻了，将人物的心理活动描写的活灵活现，感觉到书中的人物都似乎真实的在脑海中存在过；又似乎像是把自己的心里想法描写出来了，这是我之前读完茨威格的书的最大的感受。这次的《人类群星闪耀时》,十四篇历史特写，不知道又会是什么样的故事。

序言

"一个真正具有世界历史意义的时刻--一个人类的群星闪耀时刻出现以前，必然会有漫长的无谓岁月流逝而去”。“历史才是真正的诗人和戏剧家，任何一个作家都别想超越历史本身。”

MViTv2

发表于 2024-09-20 分类于 Papers 本文字数： 1.3k 阅读时长 ≈ 5 分钟

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection^[1]

作者和MViT一样，是来自FAIR和UC Berkeley的Yang hao Li, Chao-Yuan Wu等人。论文引用[1]:Li, Yanghao et al. “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 4794-4804.

Time

2021.Dec

Key Words

MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
阅读全文 »

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization[1]

Time

Key Words

总结

MRSN: Multi-Relation Support Network for Video Action Detection[1]

Time

Key Words

总结

End-to-End Spatio-Temporal Action Localisation with Video Transformers[1]

Time

Key Words

总结

DETRs Beat YOLOs on Real-time Object Detection[1]

Time

Key Words

动机

总结

YOWOv3: A Lightweight Spatio-Temporal Joint Network for Video Action Detection[1]

Time

Key Words

总结

Deformable DETR: Deformable Transformers for End-to-End Object Detection[1]

Time

Key Words

动机

总结

Non-local Neural Networks[1]

Time

Key Words

动机

Exploring Plain Vision Transformer Backbones for Object Detection[1]

Time

Key Words

总结

人类群星闪耀时

序言

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection[1]

Time

Key Words

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization^[1]

MRSN: Multi-Relation Support Network for Video Action Detection^[1]

End-to-End Spatio-Temporal Action Localisation with Video Transformers^[1]

DETRs Beat YOLOs on Real-time Object Detection^[1]

YOWOv3: A Lightweight Spatio-Temporal Joint Network for Video Action Detection^[1]

Deformable DETR: Deformable Transformers for End-to-End Object Detection^[1]

Non-local Neural Networks^[1]

Exploring Plain Vision Transformer Backbones for Object Detection^[1]

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection^[1]