RetinaNet
Focal Loss for Dense Object Detection[1]
作者是来自FAIR的Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar.论文引用[1]:Lin, Tsung-Yi et al. “Focal Loss for Dense Object Detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2017): 318-327.
Time
- 2017.Aug
Key Word
- Focal loss focues training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
- class imbalance between foreground and background classes during training.
- easy negatives
动机
- 单阶段的检测器有潜力be faster and simpler,但是精度还是不如两阶段的检测器。为什么会这样子呢?作者发现,主要原因是:在dense detectors训练过程中,前景-背景类别的极度不平衡。所以提出,通过改变交叉熵损失,down-weights the loss assigned to well-classified examples,从而来解决这个问题。
总结
在两阶段的检测器中,proposal stage会过略掉很多background samples, 在第二个分类的阶段,sampling heuristics such as fixed foreground-to-background ratio, or online hard example mining 能够maintain a manageable balance between foreground and background.相比之下,在单阶段检测中,检测器需要处理大量的candidate object locations sampled across an image, classified background samples在训练过程中占主导,这样类似的sampling heuristics效率就比较低。这个低效率的问题在目标检测中是一个经典问题, typically address via techniques such as bootstrapping, hard example mining.
提出的新的Loss函数。 这个loss function是一个dynamically scaled cross entropy loss, scaling factor decays to zero as confidence in the correct class increases. 这个scaling factor能够自动地down-weight contribution of easy examples during training,让Model focus on a sparse set of hard examples.
Focal Loss:
\[ {FL}(p_t)=-(1-p_t)^{\gamma} {log({p_t})} \]
当分类正确的时候, \(p_t \rightarrow 1\), factors 趋于0, loss for well-classified is down-weighted.
Focal loss 变体: \[ {FL}(p_t)=-{\alpha_t}(1-p_t)^{\gamma} {log({p_t})} \]
二分类模型,默认初始化的时候 y=1 或者-1有相同的概率,在这种情况下,由于class imbalance,导致frequent class主导整个loss,造成早期训练时候的不稳定。引入概念"prior" for the value of p eastimated by the model for rare class(foreground) at the start of training.
Anchors: anchors have area of \(32^{2}\) to \(512^{2}\) on pyramid level \(P_3\) to \(P_7\),
分类网络分支:是一个小的FCN attached to each FPN level, parameters of this subnet are shared across all pyramid levels.和box regression 分支不共享参数。
Box regression分支:也是一个小的FCN attached to each pyramid level. 和分类网络分支的不同就在于最后是 \(4A\)个linear output per spatial location. For each of the A anchors per spatial location, 这4个outputs预测 \(relative offset\) between the anchor and groundtruth box. Figure 3. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN while running at faster speeds.