FPN
Feature Pyramid Networks for Object Detection[1]
作者是来自FAIR和Cornell的Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie.论文引用[1]:Lin, Tsung-Yi et al. “Feature Pyramid Networks for Object Detection.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 936-944.
Time
- 2017.Apr
Key Word
- multi-scale, pyramidal hierarchy
- top-down architecture with lateral connections
- high-level sematic feature maps at all scales.
动机
- feature pyramids 是识别系统中用来检测物体的一个基本component,但是最近的深度学习的目标检测器避免了pyramid representations,部分是因为compute and memory intensive。这篇paper中,exploit inherent multi-scale、pyramid hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost
总结
FPN 展示了 significant improvement as a generic feature extractor in several applications.featurized image pyramids 在手工特征的时代用的很多.
principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.然而,featurizing each level of image pyramid也是有限制的:推理的时间会增加,并且训练的时候从memory的角度是不合理的。
image pyramid不是唯一的用来计算multi-scale feature representation的方式,deep convNet compute a feature hierarchy layer by layer,就是说:神经网络本身就能够计算出hierarchy feature(has multi-scale, pyramidal shape)。本文的目标就是利用ConvNet的 feature hierarchy的pymidal shape while creating a feature pyramid that has strong semantics at all scales.
经常有很多layers,产生相同大小的output feature maps,将这些layers称为是同一个network stage,对于feature pyramid,定义一个stage是一个pyramid level。选择每个stage的最后一层的输出为 reference set of feature maps。 特别地,对于ResNet,用feature activation outputs by each stage's last residual block,将这些最后的residual block的output记为:{C2,C3,C4,C5} for conv2, conv3,conv4,conv5. they have strides of {4,8,16,32} pixels with repsect to the input image.
upsampled map和对应的bottom-up map(经过 \(1 \times 1\)的卷积来减小channels) 融合 by element-wise addition,最后,对于每个merged map,加一个 \(3 \times 3\)的卷积,来产生最后的feature map,来减小aliasing effect of upsampling。
用FPN替换single-scale feature map来改写RPN。因为head slides over all locations in all pyramid layers,所以在一个特定的level上,anchor 没有必要是multi-scale的。 Anchor在每个level有一个scale。 定义了 \(32^{2},64^{2},128^{2}\) pixels on {P2, P3,P4,P5} respectively. 每个level三个aspect ratio,所以总共15个anchors over the pyramid.
\(Fig. 1^{[1]}\). (a) Using an image pyramid to build a feature pyramid.Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate.In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.
\(Fig. 2^{[1]}\). Top: a top-down architecture with skip connections,where predictions are made on the finest level. Bottom:our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels
\(Fig. 3^{[1]}\). A building block illustrating the lateral connection andthe top-down pathway, merged by addition.