FCOS

FCOS: Fully Convolutional One-Stage Object Detection[1]

作者是来自澳大利亚的阿德莱德大学的ZhiTian, Chunhua Shen, Hao Chen, Tong He. 论文引用[1]:Tian, Zhi et al. “FCOS: Fully Convolutional One-Stage Object Detection.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019): 9626-9635.

Time

  • 2019.Apr

Key Words

  • one-stage
  • FCN
  • per-pixel prediction fashion

动机

  1. 基于anchor的检测器的一些缺点:对于一些超参数敏感:例如aspect ratio,etc;计算量大;处理一些large shape variations的物体的时候有困难。
  1. FCN网络在一些dense prediction上取得了很好的效果如semantic segmentation。目标检测可能是唯一一个deviating from the neat fully convolutional per-pixel prediction framework mainly due to the use of anchor boxes. 所以很自然的想到:能不能类似于FCN for semantic segmentation, 用neat per-pixel prediction的方式,解决目标检测问题。因此这些基础任务能够统一在一个框架里。答案是肯定的。

总结

  1. 基于FCN的网络来做检测,会去预测一个4D vector + a class category at each spatial location on a level of feature maps. 4D vector描述了bounding box的4个边 to location的relative offsets.高度重叠的bounding boxes会导致一个不确定:在overlapped region,不清楚该regress 哪个bounding box for pixels。

  2. 为了抑制产生的低质量的框,引入了一个"center-ness" branch, 来预测deviation of a pixel to the center of its corresponding bounding box.

  3. FCOS可以在两阶段的检测器中作为RPN,能够取得很好的效果。

  4. 对于特征图\(F_i\)中的每个位置 \((x,y)\)\(s\)是 total stride until the layer. 将其map it back onto input image as \((\lfloor \frac{s}{2}\rfloor + xs, \lfloor \frac{s}{2} \rfloor +ys)\), which is near the center of the receptive field of the location \((x,y)\). 不同于anchor-based的检测器,考虑location on input image as the center of anchor boxes and regress the target bounding box with these anchor boxes as references. 这里作者直接at the location 回归 target bounding box。换句话说,检测器直接 views locations as training samples instead of anchor boxes in anchor-based detectors.

  5. location \((x,y)\)是被视为positive sample 如果它落在任何一个bounding box,而且class label \(c^*\) of the location是ground-truth的label。否则它就是negative sample and \(c^*=0\). 除了分类的label,还用一个4D vector t\(^*=(l^*,t^*,r^*,b^*)\) being the regression targets for the location. 这里的\(l^*,t^*,r^*,b^*\)是location to four sides of bounding box的距离。如果一个location落在多个bounding box。则被视为ambiguous sample. 简单的选择一个minimal area as its regression target. 后面会用multi-level prediction来减小ambigous samples. 如果一个location与一个bounding box \(B_i\)关联, training target for the location can be formulated as :

\[\begin{aligned}l^*&=x-x_0^{(i)},&t^*&=y-y_0^{(i)},\\r^*&=x_1^{(i)}-x,&b^*&=y_1^{(i)}-y.\end{aligned}\]

  1. 损失函数: \[\begin{aligned} L(\{\boldsymbol{p}_{x,y}\},\{\boldsymbol{t}_{x,y}\})& =\frac1{N_{\mathrm{pos}}}\sum_{x,y}L_{\mathrm{cls}}(\boldsymbol{p}_{x,y},c_{x,y}^{*}) \\ &+\frac{\lambda}{N_{\mathrm{pos}}}\sum_{x,y}\mathbb{1}_{\{c_{x,y}^{*}>0\}}L_{\mathrm{reg}}(\boldsymbol{t}_{x,y},\boldsymbol{t}_{x,y}^{*}), \end{aligned}\]

    其中, \(L_cls\)是focal loss,\(L_reg\)是 IOU loss.

  2. 不同于anchor-base 检测器,会assign 不同尺寸的anchor box to different feature levels. 这里直接limit range of bounding box regression for each level. 如果一个location满足: \[max(l^*,t^*,r^*,b^*)>m_i,\ or \ max(l^*,t^*,r^*,b^*) < m_{i-1}\] 则被设为negative sample,不在要求去regress a bounding box。 \(m_i\)是maximum distance that feature level \(i\) needs to regress. 如果即使用了multi-level prediction, 一个位置仍然分配给多个grounth-truth boxes,简单地选择minimal area的bounding box作为target.

  3. FCOS用了multi-level prediction,和anchor-based检测器相比,性能还是有差距,在于低质量的bounding box produced by locations far away from the center of an object.提出了一个简单的策略来抑制低质量的检测框:加了一个single layer branch to predict center-ness of a location. \[\text{centerness}^*=\sqrt{\frac{\min(l^*,r^*)}{\max(l^*,r^*)}\times\frac{\min(t^*,b^*)}{\max(t^*,b^*)}}.\]

    it is trained with BCE loss, added to loss funtion. 在测试的时候, final score is computed by multiplying the predicted center-ness with the corresponding classification score. 这个center-ness会down-weight离目标中心较远的bounding box的中心。最后通过NMS来去掉低质量的bbox.

FCOS \(Fig.2^{[1]}\): The network architecture of FCOS, where C3, C4, and C5 denote the feature maps of the backbone network and P3 to P7 are the feature levels used for the final prediction. \(H \times W\) is the height and width of feature maps. '/s' (s = 8,16,..., 128) is the down sampling ratio of the feature maps at the level to the input image. As an example, all the numbers are computed with an \(800 \times 1024\) input.