YOLO

发表于 2024-01-24 更新于 2024-03-01 分类于 Papers 阅读次数：本文字数： 631 阅读时长 ≈ 2 分钟

YOLO 系列论文

开头说几句题外话：这几天想了想，打算用Blog来记录一下看到的论文，给自己一个督促。现在AI发展日新月异，尤其是ChatGPT出来之后，各种新的论文太多了，都不知道从哪里开始看，有点眼花缭乱，思来想去，还是一步一步来，从经典论文开始，当然也会看新的热度很高的论文，通过这种方式，来一点一点的进步吧。不积跬步无以至千里；千里之行，始于足下。加油！！！只要想做，什么时候都不算晚！！🏃

You Only Look Once: Unified, Real-Time Object Detection^[1]🚀

作者是来自U of Washington、Allen Institute for AI和FAIR,包括Joseph Redmon、Santosh Divvalala、Ross Girshick 等。论文出处：[1]Redmon, Joseph et al. “You Only Look Once: Unified, Real-Time Object Detection.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 779-788.

总结：

以下“我们”代指作者

先前的工作是用分类器来做检测；我们将目标检测视为空间上分隔的bboxes和相关联的类别概率的回归问题。一个单一的神经网络从整个图像里，在一次evaluation里，直接预测bboxes和对应的类别概率。pipeline是单个网络，可以再检测性能上优化为端到端的。速度非常快，但是定位也有很多错误。
在训练和测试时，能看到全局的信息，能够对上下文信息进行编码，Fast R-CNN看不到大的上下文；在泛化性上，YOLO要好一些。Trade-off: 速度快，但是精确定位物体特别是小目标，有误差。
将输入图像生成 \(S \times S\) 的Grids，每个Grid预测 \(B\)个 bboxes和置信度，包括5个参数: \(x,y,w,h,p_c\)。每个Grid预测C个类的条件概率。
YOLO: 24层卷积网络，2个全连接层，输入图像大小为\(448\times448\),最后输出 \(7\times7\times30\)； Fast YOLO： 9个卷积层，用Sum-squared Error来进行优化，Limitations: 每个Grid预测2个bboxes，限制了邻近物体的数量。Loss Function中，同等对待大bboxes和小bboxes的errors，然而一个小的错误在大的boxes和小的boxes中的影响不同。

\(Figure\ 1^{[1]}\): The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1× 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224× 224 input image) and then double the resolution for detection。

YOLO 系列论文

You Only Look Once: Unified, Real-Time Object Detection[1]🚀

总结：

You Only Look Once: Unified, Real-Time Object Detection^[1]🚀