imTED

Integrally Migrating Pre-trained Transformer Encoder-Decoders for Visual Object Detection[1]

作者是来自国科大和清华的Qixiang Ye老师团队的,论文引用[1]:Zhang, Xiaosong et al. “Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2022): 6802-6811.

Time

  • 2022.Dec

Key Words

  • Pretrained Encoder-Decoder for object detection
  • multi-scale feature modulator
  • few-shot object detection

动机

  • MAE基于MIM的代理任务,pre-trains encoder-decoder representation models,encoders for feature extraction and decoders for image context modeling. MAE的decoder的spatial context modeling是否对object localization有益?

  • 在看了MAE、DETR之后,没看这篇文章之前,有了和这篇文章差不多的思路。。

总结

  1. imTED方法中,用pre-trainded encoder来提取特征,pre-trained decoder作为detector head,建立一个fully pre-trained feature extraction path. 中间保留了RPN for proposal generation。用来产生RoIs。这个不参与object feature extraction or transformation.只有这一部分网络的参数是随机初始化的。不影响检测器的泛化性能。

Framework \(Fig.1^{[1]}\). Comparison of the baseline detector e.g., Faster RCNN equipped with a transformer backbone (upper) with the proposed imTED (lower). The baseline detector solely transfers a pre-trained backbone network, e.g., the transformer encoder, but training the detector head and FPN from scratch. By contrast, our imTED approach integrally migrates the pre-trained transformer encoder-decoder. It significantly reduces the proportion of randomly initialized parameters and improves detector’s generalization capability.