ViViT
ViViT: A Video Vision Transformer[1]
作者是来自Google Research的Anurag Arnab, Mostafa Dehghani, Georg, Heigold, Chen Sun, Mario Lucic, Cordelia Schmid。论文引用[1]:Arnab, Anurag et al. “ViViT: A Video Vision Transformer.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6816-6826.
Time
- 2021.Jun
Key Words
- spatio-temporal tokens
- transformer
- regularize model, factorising model along spatial and temporal dimensions to increase efficiency and scalability