BMViT
Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization[1]
作者是来自英国玛丽女王大学、三星剑桥AI center等机构的Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos。论文引用[1]:Ntinou, Ioanna, Enrique Sanchez, and Georgios Tzimiropoulos. "Multiscale vision transformers meet bipartite matching for efficient single-stage action localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Time
- 2024.May
Key Words
- bipartite matching loss
- Video Transformer with bipartite matching loss without learnable queries and decoder
总结
- Action localization是一个挑战行的问题,结合了检测和识别,通常是分开处理的,SOTA方法都是依赖off-the-shelf bboxes detection,然后用transformer model来聚焦于classification task。 这样的两阶段的方法不利于实时的部署。另外,单节段的方法通过共享大部分的负载来实现这两个任务,牺牲性能换取速度。类似DETR的架构训练起来有挑战。本文观察到:一个直接的bipartite matching loss可以用在ViT的output tokens上,导致一个 backbone + MLP 架构能够需要要额外的encoder-decoder head和learnable queries来同时处理这两个任务。用单一的MViTv2-S架构 with bipartite matching 来执行两个tasks,超过了MViTv2-S trained with RoI align on pre-computed bboxes。用设计的token pooling和提出的训练的pipeline,Bipartite-Matching Vision Transformer, BMViT。实现了很好的结果。