Young's Blog

LW-DETR

发表于 2024-09-16 更新于 2024-10-23 分类于 Papers 本文字数： 1.6k 阅读时长 ≈ 6 分钟

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection^[1]

作者是来自百度、阿德莱德大学、北航、自动化所和澳洲国立大学的Qiang Chen,Xiangbo Su, Xinyu Zhang等人。论文引用[1]:

Key Words

Real-Time Detection With Transformer
interleaved window and global attention
window-major order feature map organization

Time

2024.Jun

总结

作者提出了一个light-weight transformer, LW-DETR，在实时检测上超过了YOLOs，这个架构是简单地ViT encoder、projector、和一个浅的DETR decoder的堆叠。这个方法利用了最近的技术包括training-effective techniques：improved loss和预训练，interleaved window 和global attention用来减小ViT encoder的复杂度。通过汇聚多个level的feature maps、intermediate 和final feature mapss来提高ViT encoder，形成更丰富的特征图，引入window-major feature map，来提高interleaved attention计算的效率。结果展示提出的方法超过了现有的检测器，包括YOLO和它的变体。

阅读全文 »

internLM 学习

发表于 2024-09-07 更新于 2024-09-20 分类于 Learning 本文字数： 24 阅读时长 ≈ 1 分钟

记录一下参加internLM活动的学习过程

InternLM的链接为 https://github.com/InternLM/Tutorial,

Docker配置及使用

发表于 2024-09-04 分类于 Tools 本文字数： 95 阅读时长 ≈ 1 分钟

Docker的配置及使用

windows和linux安装docker的方式有点不一样，但也不复杂，主要的地方在于需要弄一个registry_mirror，虽然不知道还有没有效，当然，能科学上网的话就方便很多了。

阅读全文 »

SLAM 学习记录

发表于 2024-09-03 分类于 Learning 本文字数： 351 阅读时长 ≈ 1 分钟

SLAM介绍

SLAM: Simultaneous Localization and Mapping，翻译为“即时定位与建图”，是指搭载特定传感器的主体，在没有环境先验信息的情况下，于运动过程中建立环境的模型，同时估计自己的运动，如果这里的传感器主要为相机，那就称为“视觉SLAM”。

Visual SLAM

经典视觉SLAM的框架主要有几个步骤：
- 传感器信息读取：在视觉SLAM中主要为相机图像信息的读取和预处理；如果在机器人中，可能还有码盘、IMU等传感器信息的读取和同步。
- 前端视觉里程计(Visual Odometry, VO)：视觉里程计的任务是估算相邻图像间相机的运动，以及局部地图的样子，VO又称为前端(Front End)。
- 后端(非线性)优化(Optimization)：后端接受不同时刻视觉里程计测量的相机位姿，以及回环检测的信息，对它们进行优化，得到全局一致的轨迹和地图，由于接在VO之后，又称为后端(Back End)。
- 回环检测(Loop Closure Detection)：回环检测判断机器人是否到达过先前的位置，如果检测到回环，它会把信息提供给后端进行处理。
- 建图(Mapping)。它是根据估计的轨迹，建立与任务要求对应的地图。

WOO

发表于 2024-08-27 更新于 2024-08-29 分类于 Papers 本文字数： 3.4k 阅读时长 ≈ 13 分钟

Watch Only Once：An End-to-end Video Action Detection Framework^[1]

作者是来自港大的罗平老师组的Shoufa Chen、Peize Sun、Enze Xie等人。论文引用[1]:Chen, Shoufa et al. “Watch Only Once: An End-to-End Video Action Detection Framework.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 8158-8167.

Time

2021.Oct

Key Words

end-to-end unified network
task-specific features

总结

提出了一个端到端的pipeline for video action detection。当前的方法要么是将video action detection 这个任务解耦成action localization和action classification这两个分离的阶段，要么在一个阶段里训练两个separated models。相比之下，作者的方法将actor localization和action classification弄在了一个网络里。通过统一backbone网络，去掉很多认为的手工components，整个pipeline被简化了。WOO用一个unified video backbone来提取features for actor location 和action localization,另外，引入了spatial-temporal action embeddings，设计了一个 spatial-temporal fusion module来得到更多的含有丰富信息的discriminative features，提升了action classification的性能。

阅读全文 »

YOWOv3

发表于 2024-08-26 更新于 2024-10-25 分类于 Papers 本文字数： 2.1k 阅读时长 ≈ 8 分钟

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition^[1]

作者是Nguyen Dang Duc Manh, Duong Viet Hang等人。论文引用[1]:Dang, Duc M et al. “YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition.” (2024).

Time

2024.Aug

Key Words

one-stage detector
different configurations to customie different model components
efficient while reducing computational resource requirements

总结

YOWOv3是YOWOv2的增强版，提供了更多的approach,用了不同的configurations来定制不同的model，YOWOv3比YOWOv2更好。
STAD是计算机视觉中一个常见的任务，涉及到检测location(bbox), timing(exact frame),and type(class of action)，需要对时间和空间特征进行建模。有很多的方法来解决STAD的问题，例如ViT，ViT的效果很好，但是计算量比较大。例如Hiera model由超过600M的参数，VideoMAEv2由超过1B的参数，增加了训练的成本和消耗。为了解决STAD问题，同时最大程度减弱训练和推理时间的成本，有人提出用了YOWO方法，虽然可以做到实时，但是也有限制：不是一个efficient model with low computational requirements。框架的作者已经停止维护了，但是还有很多的问题。本文的contribution如下：
- new lightweight framework for STAD
- efficient model
- multiple pretrained resources for application：creating a range of pretrained resources spanning from lightweight to sophisticated models to cater to diverse requirements for real-world applications。

阅读全文 »

YOWOv2

发表于 2024-08-26 更新于 2024-09-01 分类于 Papers 本文字数： 2.8k 阅读时长 ≈ 10 分钟

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time STAD^[1]

作者是来自哈工大的 Jianhuan Yang和Kun Dai，论文引用[1]:Yang, Jianhua and Kun Dai. “YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection.” ArXiv abs/2302.06848 (2023): n. pag.

Time

2023.Feb

Key Words

combined 2D CNN of diffferent size with 3D CNN
anchor-free mechanism
dynamic label assignment
multi-level detection structure

总结

YOWOv2利用了3D backbone和2D backbone的优势，来做accurate action detection。设计了一个multi-level detection pipeline来检测不同scales的action instances。为了实现这个目标，构建了一个简单高效地2D backbone with FPN，来提取不同level的classification features和regression features。对于 3D backbone，采用现有的3D CNN，通过结合3D CNN和不同size的2D CNN，设计了YOWOv2 family, 包括:YOWOv2-Tiny，YOWOv2-Medium和YOWOv2-Large。同时引入了dynamic label assignment strategy和anchor-free机制，来使得YOWOv2和先进的模型架构一致。YOWOv2比YOWO好很多，同时能够保证实时检测。

阅读全文 »

TAAD

发表于 2024-08-24 更新于 2024-08-30 分类于 Papers 本文字数： 2k 阅读时长 ≈ 7 分钟

Spatio-Temporal Action Detection Under Large Motion^[1]

作者是来自ETHZ的Gurkirt Singh, Vasileios Choutas, Suman Saha, Fisher Yu和Luc Van Gool。论文引用[1]:Singh, Gurkirt et al. “Spatio-Temporal Action Detection Under Large Motion.” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022): 5998-6007.

Time

2022.Oct

Key Words

track information for feature aggregation rather than tube from proposals
3 motion categories: large motion、medium motion、small motion

总结

当前的STAD的tube detection的方法经常将一个给定的keyframe上的bbox proposal扩展成一个3D temporal cuboid，然后从邻近帧进行pool features。如果actor的位置或者shape表现出了large 2D motion和variability through frames，这样的pooling不能够积累有意义的spaito-temporal features。在这个工作中，作者旨在研究cuboid-aware feature aggregation in action detection under large action。进一步，提出了在large motion的情况下，通过tracking actors和进行temporal feature aggregation along the respective tracks增强actor feature representation，定义了在不同的固定的time scales下的actor motion的IoU。有large motion的action会随着时间导致lower IoU，slower actions会随着时间维持higher IoU。作者发现track-aware feature aggregation持续地实现了很大的提升in action detection。

阅读全文 »

TubeR

发表于 2024-08-24 更新于 2024-11-08 分类于 Papers 本文字数： 3.7k 阅读时长 ≈ 13 分钟

TubeR: Tubelet Transformer for Video Action Detection^[1]

作者是来自阿姆斯特丹大学、罗格斯大学和AWS AI Labs的Jiaojiao Zhao、Yanyi Zhang等人。论文引用[1]:Zhao, Jiaojiao et al. “TubeR: Tubelet Transformer for Video Action Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 13588-13597.

Time

2021.April

Key Words

learns a set of tubelet queries to pull action-specific tubelet-level features from a spatio-temporal video representation
spatial and temporal tubelet attention allows tubelets to be unrestricted in spatial location and scale over time
context aware classification head along with tubelet feature, takes the full clip feature from which our classification head can draw contextual information.
end-to-end without person detectors, anchors or proposals.

总结

不同于现有的依赖于离线检测器或者人工设计的actor-positional hypotheses like proposals or anchors，提出了一个通过同时进行action localization和recognition from a single representation，直接检测视频里的action tubelet的方法。TubeR学习一系列的tubelet queries，利用tubelet-attention module来model video clip里的动态的spatio-tempral nature。相比于用actor-positional hypotheses in the spatio-temporal space，它能够有效的强化模型的能力。对于包含transitional states或者scene changes的视频，提出了一个context aware classification head，来利用short-term和long-term context to strengthen action classification，和一个action switch regression head 来检测精确的时序上的行为范围。TubeR直接产生不同长度的action tubelets，对于长的视频clips，也能保持一个比较好的结果。

阅读全文 »

EVAD

发表于 2024-08-24 更新于 2024-08-27 分类于 Papers 本文字数： 3k 阅读时长 ≈ 11 分钟

Efficient Video Action Detection with Token Dropout and Context Refinement^[1]

作者是来自nju、蚂蚁集团、复旦和上海AI Lab的Lei Chen、Zhan Tong、Yibing Song等人。论文引用[1]:Chen, Lei et al. “Efficient Video Action Detection with Token Dropout and Context Refinement.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 10354-10365.

Time

2023.Aug

Key Words

spatiotemporal token dropout
maintain all tokens in keyframe representing scene context
select tokens from other frames representing actor motions
drop out irrelavant tokens.

总结

视频流clips with large-scale vieo tokens 阻止了ViTs for efficient recognition，特别是在video action detection领域，这是需要大量的时空representations来精确地actor identification。这篇工作，提出了端到端的框架 for efficient video action detection(EVAD) based on vanilla ViTs。EVAD包含两个为视频行为检测的特殊设计。首先：提出来时空token dropout from a keyframe-centric perspective. 在一个video clip中，main all tokens from its keyframe，保留其它帧中和actor motions相关的tokens。第二：通过利用剩余的tokens，refine scene context for better recognizing actor identities。action detector中的RoI扩展到时间域。获得的时空actor identity representations are refined via scene context in a decoder with the attention mechanism。这两个设计使得EVAD高效的同时保持精度。

阅读全文 »

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection[1]

Key Words

Time

总结

记录一下参加internLM活动的学习过程

Docker的配置及使用

SLAM介绍

Visual SLAM

Watch Only Once：An End-to-end Video Action Detection Framework[1]

Time

Key Words

总结

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition[1]

Time

Key Words

总结

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time STAD[1]

Time

Key Words

总结

Spatio-Temporal Action Detection Under Large Motion[1]

Time

Key Words

总结

TubeR: Tubelet Transformer for Video Action Detection[1]

Time

Key Words

总结

Efficient Video Action Detection with Token Dropout and Context Refinement[1]

Time

Key Words

总结

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection^[1]

Watch Only Once：An End-to-end Video Action Detection Framework^[1]

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition^[1]

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time STAD^[1]

Spatio-Temporal Action Detection Under Large Motion^[1]

TubeR: Tubelet Transformer for Video Action Detection^[1]

Efficient Video Action Detection with Token Dropout and Context Refinement^[1]