Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection[1]

作者是来自清华、HKUST、CUHKsz、IDEA研究院和MSRR的Shilong Liu, Zhaoyang Zeng, Tianhe Ren等人。论文引用[1]:

Time

  • 2023.Mar

Key Words

  • extend DINO by performing vision-language modality fusion at multiple phases, feature enhances, language-guided query selection, and cross-modality decoder.
  • extend evaluation of open-set object detection to REC datasets.

总结

  1. 在文本中,作者提出了一个开集目标检测器,称之为Grounding DINO,通过将基于Transformer的检测器DINO和grounded pre-training结合起来,就能够检测任意的目标,例如类别名称和referring expressions。这个开机目标检测的关键解决方式是将语言引入了闭集目标检测器,用于开机概念泛化。为了有效地融合语言和视觉模态,作者将闭集目标检测器划分成3个阶段,提出了一个tight fusion solution,包含了一个feature enhancer,一个语言引导的query selection,和一个跨模态的decoder用于跨模态融合。之前的工作主要是在noval categories上评估开集目标检测,作者提出在referring expression comprehension上进行评估,用于objects specified with attributes。Grounding DINO在所有的3个设置上表现都很好,包括COCO数据集,LVIS,ODinW。
阅读全文 »

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[1]

作者是来HKUST(Guangzhou)、HKUST、清华、IDEA研究院的Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M.Ni, Heung-Yeung Shum.论文引用[1]:Zhang, Hao et al. “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.” ArXiv abs/2203.03605 (2022): n. pag.

Time

  • 2022.Mar

Key Words

  • DETR
  • DeNoising Anchor
  • mixed query

动机

  1. DETR训练收敛很慢,meaning of queries不清楚。之前的DETR类的model的表现不如经典的detectors
  2. scalability of DETR-like models没有被研究过,no reported result about how DETR-like models perform when scaling to a large backbone and a large-scale data set.

总结

  1. DINO通过一个contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction,提高了DETR-like model的性能和效率。
阅读全文 »

Video Twin Transformer[1]

作者是来自MSRA、USTC、HUST、THU的Ze Liu, Jia Ning, Yue Cao, Yixuan Wei,Zheng Zhang, Stephen Lin, Han Hu。论文引用[1]:Liu, Ze et al. “Video Swin Transformer.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 3192-3201.

Time

  • 2021.Jun

总结:

  1. 建立在Transformer上的视频模型能偶全局链接patches across the spatial and temporal dimensions。在本文中,**支持video Transformers的局部的归纳偏置(inductive bias of locality)。能够实现很好的速度-精度的平衡。这个locality是通过采用Swin Transformer来实现的。之前的卷积模型的backbone for video都是采用来自iamge的,通过简单地在时间轴上进行扩展。例如:3D 卷积是2D 卷积的直接扩展,用于spatial 和 temporal modeling at the operator level。
阅读全文 »

ViViT: A Video Vision Transformer[1]

作者是来自Google Research的Anurag Arnab, Mostafa Dehghani, Georg, Heigold, Chen Sun, Mario Lucic, Cordelia Schmid。论文引用[1]:Arnab, Anurag et al. “ViViT: A Video Vision Transformer.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6816-6826.

Time

  • 2021.Jun

Key Words

  • spatio-temporal tokens
  • transformer
  • regularize model, factorising model along spatial and temporal dimensions to increase efficiency and scalability
阅读全文 »

总结一些使用torch过程中的常用的tips和相关知识点

  1. torch 中的dim:
Dimension
  1. Softmax 和Sigmoid函数:

https://zhuanlan.zhihu.com/p/525276061

https://www.cnblogs.com/cy0628/p/13921725.html

  1. torchscript 是pytorch模型的中间表示,Pytorch提供了一套JIT工具:Just-in-Time,允许用户将模型转换为Torchscript格式;保存后的torchscript模型可以在像C++这种高性能的环境中运行,torchscript是一种从pytorch代码创建可序列化和可优化模型的方法。任何torchscript程序都可以从python进程中加载,并加载到没有python解释器的环境中。torchscript能将动态图转化为静态图。torchscript常和torch.jit合起来用。两种方式:
    • torch.jit.trace:把模型和example输入进去,然后调用模型,记录下模型run的时候所进行的操作,但是有decision branch例如if-else这种,torch.jit.trace只是记录当前代码走的路径,control-flow被抹除了;生成的TorchScript模型可以直接用于推理,而不需要Python解释器。只支持前向传播。这意味着它不能用于训练或反向传播。此外,由于它是通过实际输入数据来跟踪模型的,因此它可能无法处理一些边缘情况或异常输入
    • torch.jit.script:这有分支的情况下,用torch.jit.script。forward方法会被默认编译,forward中被调用的方法也会按照被调用的顺序被编译;相比之下,torch.jit.script允许用户将整个训练循环(包括前向传播和反向传播)转换为TorchScript模型。这意味着它可以直接用于模型的训练和验证。torch.jit.script可以处理更广泛的模型和计算图,并且可以更好地处理异常情况。此外,它还支持自定义类和函数,这使得它更加灵活和强大
    • 如果想要方法不被编译,可以使用 @torch.jit.ignore 或者 @torch.jit.unused
    • 把pytorch模型部署到c++平台上的流程主要是:模型转换、保存序列化模型、C++中加载序列化的pytorch模型以及执行script module
    • 相关链接:
      • https://mp.weixin.qq.com/s/7JjRGgg1mKlIRuSyPC9tmg
      • https://blog.csdn.net/hxxjxw/article/details/120835884
      • Pytorch中文翻译的网站:https://pytorch.ac.cn/docs/stable/index.html
      • https://developer.baidu.com/article/detail.html?id=2995518
      • https://pytorch.panchuang.net/EigthSection/torchScript/
  2. 常用的注意力模块的一些链接:
    • https://www.cnblogs.com/wxkang/p/17133460.html,各种注意力机制
    • https://www.cnblogs.com/Fish0403/p/17221430.html, SE和CBAM
    • https://cloud.tencent.com/developer/article/1776357, Vision Transformer的综述
  3. 一个好用的可视化工具:torchinfo:
    • pip install torchinfo,能够查看网络模型的输入输出,尺寸,参数量等各类型的指标,方便理解模型。

RCNN[1]、Fast RCNN[2]、Faster RCNN[3] Mask RCNN[4]系列

  1. 目标检测的two-stage 方法的系列,从RCNN 到Faster RCNN,RCNN的作者是来自UC Berkeley的Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik;Fast RCNN的作者是来自Microsoft Research的Ross Girshick;Faster RCNN的作者是来自Microsoft Research的Shaoqing Ren, Kaiming He, Ross Girshick, 和孙剑; Mask RCNN的作者是论文引用[1]:Girshick, Ross B. et al. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” 2014 IEEE Conference on Computer Vision and Pattern Recognition (2013): 580-587. [2]:Girshick, Ross B.. “Fast R-CNN.” (2015).,[3]:>作者是来自Microsoft Research的Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun.[4]:

相关资源

这是一个openmmlab中算法的一些教程和学习资料,介绍的还不错

openmmlab Book

这个是百度PaddlePaddle的资料:

PaddlePaddle Edu

Time

  • RCNN: 2013.Nov
  • Fast RCNN: 2015.Apr
  • Faster RCNN: 2015.Jun
阅读全文 »

An Image is Worth \(16 \times 16\) Words:Transformers For Image Recognition at Scale[1]

作者比较多,都是来自Google Research, Brain Team的Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 论文引用[1]:Dosovitskiy, Alexey et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ArXiv abs/2010.11929 (2020): n. pag.

Time

  • 2020.Oct

Key Words

  • Vision Transformer
  • Image patches (in Vision) \(\Leftrightarrow\) tokens (words) in NLP
  • larger scale training

总结

  1. 自注意力机制的dominant approach is 在large text corpus预训练,然后在smaller task-specific dataset上进行微调。Thanks to Transformers' computational efficiency 和可扩展性(scalability), 训练一个over 100B parameters、unprecedented size的model成为可能。随着model和dataset的growing,没有饱和的迹象(no sign of saturating performance)
阅读全文 »

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions[1]

作者是来自Google Research、Inria Laboratoire Jean Kuntzmann, Grenoble, France, UC Berkeley的Chunhui Gu、Chen Sun、David A.Ross等人。论文引用[1]:Gu, Chunhui et al. “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6047-6056.

Time

  • 2017.May

Key Words

  • aotmic visual actions rather than composite actions
  • precise spatio-temporal annotations with possibly multiple annotations for each person
  • exhaustive annotation of these atomic actions over 15-minute video clips
  • people temporally linked across consecutive segments

总结

  1. dataset is sourced from 15th - 30th minute time intervals of 430 different movies, which given 1 Hz sampling frequency gives us nearly 900 keyframes for each movie. In each keyframe, every person is labeled with (possibly multiple) actions from AVA vocabulary. Each person is linked to the consecutive keyframes to provide short temporal sequences of action labels.

    阅读全文 »

Kalman Filtering(卡尔曼滤波)

  1. 卡尔曼滤波是最常用最重要的状态估计算法之一。卡尔曼滤波能从不确定且非精确的测量中估计隐藏状态,同时还可以根据历史估计值对未来系统状态进行预测。滤波算法以 Rudolf E.Kalman的名字命名。在1960年,卡尔曼发表了著名的论文,描述了一个离散数据的线性滤波问题的递归算法。如今它被广泛应用于目标追踪、定位和导航系统、控制系统等领域。
阅读全文 »
0%