Young's Blog

FPN

发表于 2024-05-11 更新于 2024-05-16 分类于 Papers 本文字数： 755 阅读时长 ≈ 3 分钟

Feature Pyramid Networks for Object Detection^[1]

作者是来自FAIR和Cornell的Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie.论文引用[1]:Lin, Tsung-Yi et al. “Feature Pyramid Networks for Object Detection.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 936-944.

Time

2017.Apr

Key Word

multi-scale, pyramidal hierarchy
top-down architecture with lateral connections
high-level sematic feature maps at all scales.

阅读全文 »

SPPNet

发表于 2024-05-11 更新于 2024-05-15 分类于 Papers 本文字数： 386 阅读时长 ≈ 1 分钟

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition^[1]

作者是何恺明，张祥宇，任少卿和孙剑。论文引用[1]:He, Kaiming et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2014): 1904-1916.

Time

2014.Jun

Key Words

spatial pyramid pooling

动机

当前的CNNs要求输入图片有fixed-size，这个要求可能会降低recognition accuracy for images or sub-images of an arbitrary size.

阅读全文 »

imTED

发表于 2024-05-07 分类于 Papers 本文字数： 313 阅读时长 ≈ 1 分钟

Integrally Migrating Pre-trained Transformer Encoder-Decoders for Visual Object Detection^[1]

作者是来自国科大和清华的Qixiang Ye老师团队的，论文引用[1]:Zhang, Xiaosong et al. “Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2022): 6802-6811.

Time

2022.Dec

Key Words

Pretrained Encoder-Decoder for object detection
multi-scale feature modulator
few-shot object detection

动机

MAE基于MIM的代理任务，pre-trains encoder-decoder representation models，encoders for feature extraction and decoders for image context modeling. MAE的decoder的spatial context modeling是否对object localization有益？
在看了MAE、DETR之后，没看这篇文章之前，有了和这篇文章差不多的思路。。

阅读全文 »

Deep OC SORT

发表于 2024-05-06 更新于 2025-03-02 分类于 Papers 本文字数： 1.8k 阅读时长 ≈ 6 分钟

Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification^[1]

作者是来自CMU的Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, Kris Kitani.论文引用[1]:Maggiolino, Gerard et al. “Deep OC-Sort: Multi-Pedestrian Tracking by Adaptive Re-Identification.” 2023 IEEE International Conference on Image Processing (ICIP) (2023): 3025-3029.

Time

2023.Feb

Key Words

Kalman filter
adding appearance cues to motion-based object association
camera motion compensation
introduce visual appearance to OC-SORT

动机

基于运动的关联 for MOT在最近的一些检测器中，重新占了主导地位。尽管如此，在启发式的、缺乏对特征退化的robustness的模型之外，很少工作考虑到吸收外观cues。本文中，作者提出了一个新的方式，来利用目标的外观，自适应地将外观匹配(appearance matching)集成到现有的基于motion的高性能的方法中。

阅读全文 »

ByteTrack

发表于 2024-05-06 更新于 2024-05-09 分类于 Papers 本文字数： 761 阅读时长 ≈ 3 分钟

ByteTrack: Multi-Object Tracking by Associating Every Detection Box^[1]

作者是来自华科、港大和字节跳动的Yifu Zhang、Peize Sun、Yi Jiang、Dongdong Yu等人。论文引用[1]:Zhang, Yifu et al. “ByteTrack: Multi-Object Tracking by Associating Every Detection Box.” European Conference on Computer Vision (2021).

Time

2021.Oct

Key Words

Multi-object tracking

动机

objects with low detection boxes are simply thrown away, which brings non-negligible true object missing and fragmented trajectories。

总结

多目标追踪旨在估计bbox和目标的identity。大部分的方法通过关联detection box whose scores are higher than a threshold. objects with low detection scores, occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories.为了解决这个问题，提出了一个简单有效的关联方法，通过关联几乎所有的detection box而不是仅仅高分的box，来实现追踪。对于低分的detection box，利用它们与tracklets的相似度来恢复true objects和过滤掉background detections。当应用9个不同的SOTA trackers的时候，该方法有提高。

阅读全文 »

CLIP

发表于 2024-05-06 更新于 2025-03-02 分类于 Papers 本文字数： 2.1k 阅读时长 ≈ 8 分钟

Learning Transferable Visual Models From Natural Language Supervision^[1]

作者是来自OpenAI的Alec Radford, Jong Wook Kim, Chris Halacy, Aditya Ramesh, Gabriel Goh, Sandhini, Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, IIya Sustkever.论文引用[1]:Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” International Conference on Machine Learning (2021).

Time

Feb.2021

Key Words

image-text pairs
CLIP: Contrastive Language-Image Pre-training
Learning from natural language supervision
perform a wide set of tasks during pre-training including OCR,geo-localization, action recognition, and more

阅读全文 »

Bye April

发表于 2024-04-30 更新于 2024-05-01 分类于日常本文字数： 162 阅读时长 ≈ 1 分钟

再见，四月

在4月中旬，自己学会了蛙泳，感觉真的好棒，从3月开始，基本每周去一次游泳馆，室友带我游了一次之后，后面虽然，呛了好多次水，喝了很多水😹；但是，在4月初的某个周六，自己下水，一下子就找到感觉了，第一次，差一点能游到对岸；后面每次有进步：能够一口气游到对岸；中间小歇几分钟，就能够游一个来回。感觉很棒！！！技能+1😁

另外，听到了很好听的专辑，真的上头！！🎶🎸

五月

多读Paper，多去Coding💻！继续加油！！

Swin Transformer

发表于 2024-04-22 更新于 2024-05-05 分类于 Papers 本文字数： 708 阅读时长 ≈ 3 分钟

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows^[1]

作者是来自MSRA的Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.论文引用[1]:Liu, Ze et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 9992-10002.

Time

2021.Mar

Key Words

Shifted windows
non-overlapping local windows
hierarchical feature maps
linear computational complexity to image size
much lower latency

阅读全文 »

MViT

发表于 2024-04-21 更新于 2024-09-24 分类于 Papers 本文字数： 1.2k 阅读时长 ≈ 4 分钟

Multiscale Vision Transformer

MViT的作者是来自FAIR和 UC Berkeley的Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhichegn Yan, Jitendra, Malik, Christoph Feichtenhofer。MViTv2的作者也是他们，多了Chao-Yuan Wu. 论文引用[1]:Fan, Haoqi et al. “Multiscale Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6804-6815. [2]:Li, Yanghao et al. “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 4794-4804.

Time

MViT：2021.Apr
MViTv2: 2021.Dec

MViT

动机

posit that the fundamental vision principle of resolution and channel scaling, can be beneficial for transformer models across a variety of visual recognition tasks.

Key Words

connect seminal idea of multiscale feature hierarchy with transformer model
progressively expand the channel capacity, while pooling the resolution from input to output of the network.

阅读全文 »

Hungarian

发表于 2024-04-21 更新于 2024-10-10 分类于 Learning 本文字数： 689 阅读时长 ≈ 3 分钟

Hungarian Algorithm(Kuhn-Munkres)(匈牙利算法)

匈牙利算法又称为KM算法，可以在 \(O(n^3)\) 的时间复杂度内解决二分图最大匹配问题。考虑到二分图中的两个集合中的点并不总是相同，为了能应用KM算法解决二分图的最大权匹配，需要先做如下处理：将两个集合中点数比较少的补点，使得两边点数相同，再将不存在的边权重设为0，这种情况下，问题就转换成 最大权完美匹配问题，从而能应用KM算法求解。是一种在多项式时间内求解任务分配问题的组合优化算法，并推动了后来的原始对偶方法，美国数学家哈罗德库恩于1955年提出该算法，之所以被称作匈牙利算法，是因为算法很大一部分是基于以前匈牙利数学家柯尼格德内什和艾盖瓦里耶内的工作之上创建起来的。詹姆斯芒克勒斯于1957年回顾了该算法，并发现它的时间复杂度为(强)多项式时间，此后该算法被称为库恩芒克勒斯算法/芒克勒斯分配算法，原始算法的时间复杂度为 \(O(n^4)\)，但杰克爱德蒙斯与理查德卡普发现可以修改算法达到 \(O(n^3)\) 的时间复杂度。
匈牙利算法寻找最大匹配，就是通过不断寻找原有的匹配M的增广路径，因为找到一条M匹配的增广路径，就意味着一个更大的匹配M'，其恰好比M多一条边。
对于图来说，最大匹配不是唯一的，但是最大匹配的大小是唯一的。

阅读全文 »

Feature Pyramid Networks for Object Detection[1]

Time

Key Word

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[1]

Time

Key Words

动机

Integrally Migrating Pre-trained Transformer Encoder-Decoders for Visual Object Detection[1]

Time

Key Words

动机

Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification[1]

Time

Key Words

动机

ByteTrack: Multi-Object Tracking by Associating Every Detection Box[1]

Time

Key Words

动机

总结

Learning Transferable Visual Models From Natural Language Supervision[1]

Time

Key Words

再见，四月

五月

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[1]

Time

Key Words

Multiscale Vision Transformer

Time

MViT

动机

Key Words

Hungarian Algorithm(Kuhn-Munkres)(匈牙利算法)

Feature Pyramid Networks for Object Detection^[1]

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition^[1]

Integrally Migrating Pre-trained Transformer Encoder-Decoders for Visual Object Detection^[1]

Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification^[1]

ByteTrack: Multi-Object Tracking by Associating Every Detection Box^[1]

Learning Transferable Visual Models From Natural Language Supervision^[1]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows^[1]