CenterNet
Object as Points[1]
作者是来自UT Austin, UC Berkeley的Xingyi Zhou,Dequan Wang, Philipp Krahenbuhl。论文引用[1]:Zhou, Xingyi et al. “Objects as Points.” ArXiv abs/1904.07850 (2019): n. pag.
Time
- 2019.Apr
Key Words
- model object as a single point -- center point of its bounding box
- keypoint estimation
动机
- 大多数的目标检测器会产生大量的潜在的object locations,and classify each,这是wasteful, inefficient, 需要很多后处理。
总结
用keypoint estimation来找到center points and regresses to all other object properties such as sizes, 3D location, orientation and even pose. 两阶段的检测器需要后处理,来除掉duplicated detections for the same instance by computing bounding box IoU. 这个后处理很难differentiate and train.
将图片送入Full conv network, generates a heatmap, peaks in this heatmap correspond to object centers. Image features at each peak predict the objects bounding box height and weight. 推理是single network forward-pass,无需非极大抑制来做后处理。之前也有keypoint estimatino来做目标检测,如: CornerNet, ExtremeNet,这些和centerNet用的都是相同的network,但是它们需要 combinatorial grouping stage after keypoint detection. 这个会使算法变慢。 CenterNet simply extracts a single center point per object without the need for grouping or post-processing。
Keypoint heatmap $ ^{ } $, \(R\) 是ouput stride, \(C\) is the number of keypoint types. Keypoint types include \(C =17\) human joints in human pose estimation, or \(C=80\) object categories in object detection. 对于 ground truth keypoint \(p \in R^2\) of class c, compute a low-resolution equivalent \(\hat{p} = \lfloor{\frac{p}{R}}\rfloor\)。 We then splat all grouth truth keypoints onto a heatmap $Y^{ } $ using a Gaussian kernel \[Y_{xyc}=\exp\left(-\frac{(x-\tilde{p}_{x})^{2}+(y-\tilde{p}_{y})^{2}}{2\sigma_{p}^{2}}\right)\]
If two Gaussians of the same class overlap, we take the element-wise maximum. The training objective is a penalty-reduced pixel-wise logistic regression with focall loss:
\[L_k=\frac{-1}{N}\sum_{xyc}\begin{cases}(1-\hat{Y}_{xyc})^\alpha\log(\hat{Y}_{xyc})&\text{if} Y_{xyc}=1\\[2ex](1-Y_{xyc})^\beta(\hat{Y}_{xyc})^\alpha&\text{(1)}\\[2ex]\log(1-\hat{Y}_{xyc})&\text{otherwise}\end{cases}\]
\(\alpha\) and \(\beta\) 是超参数, \(N\)是 image \(I\) 中的keypoint的数量。 To recover the discretization error caused by the output stride, we additionally predict a local offset \(\hat{O} \in R^{\frac{W}{R} \times \frac{H}{R} \times 2}\) for each center point.所有的类C共享offset prediction. Offset是用L1 Loss
\[L_{off}=\frac{1}{N}\sum_{p}\left|\hat{O}_{\tilde{p}}-\left(\frac{p}{R}-\tilde{p}\right)\right|.\quad(2)\]
The supervision acts only at keypoints location \(\hat{p}\), all other locations are ignored.
\[L_{size}=\frac{1}{N}\sum_{k=1}^{N}\left|\hat{S}_{p_{k}}-s_{k}\right|.\text{(3)}\]
总的training object is :
\[L_{det}=L_{k}+\lambda_{size}L_{size}+\lambda_{off}L_{off}.\quad(4)\]
Hourglass: The stacked Hourglass Network downsamples the input by \(4 \times\), followed by two sequential hourglass modules. 每个hourglass module 是对称的 5-layer down- and up- convolution network with skip connections.
\(Fig. 1^{[1]}\): difference between anchor-based detectors and center point detector$
\(Fig. 2^{[1]}\): model an object as the center point of its bounding box. The bounding box size and other object properties are inferred from the keypoint feature at the center.