aed

发表于 2025-04-20 更新于 2025-05-04 分类于 Papers 阅读次数：本文字数： 3.2k 阅读时长 ≈ 12 分钟

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors^[1]

作者是来自University of Bucharest等机构的Nicolae-Catalin Ristea等人，论文引用[1]:Ristea, Nicolae-Cătălin et al. “Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 15984-15995.

Time

2024.Mar

Key Words

用motion weight进行加权，然后用了self-distillation，同时还使用了synthetic anomalies data，加入到training data中，来提高video anomaly的性能。

总结

作者提出了一个高效的异常时间检测model，基于应用在video frame level上的轻量的AE。提出的model的创新型有三个方面：(1)基于motion gradients，引入了一个方式来对tokens进行加权，将focus的重心从static background scene转移到foreground objects;(2) 集成了一个teacher decoder和一个student decoder，利用两个decoder的输出的差异来提高anomaly detection; (3) 生成合成的abnormal events，来增强训练videos，让masked AE model来重建original frames和对应的pixel-level anomaly maps。作者的设计是一个高效且有效的model。

视频异常检测的复杂性是由于:异常的场景是context-dependent，而且不是经常发生的。这使得以全监督的方式，搜集一些代表性的异常事件用于训练SOTA model是比较困难的。因为在全监督的情况下训练anomaly detecttors是不可能的，大多数的研究采用了不同的方式，提出了outlier detection methods的变种。这样的方法将abnormal event detection视为一个outlier detection task。一个在normal events上训练的normality model应用在normal和abnormal events during inference，将偏离于learned model的events标记为 abnormal。不同于主流的基于outlier detection的方法，作者提出了一个方法，通过随机叠加合成的UBnormal数据集中的temporal action segments到的真实世界数据集中，从而为每个训练视频场景增强synthetic segments。作者在训练的时候引入了synthetic anomalies，使得模型能够以open-set 监督的方式进行学习。另外，作者使model重建original training frames(without anomalies)，来限制它重建anomalies的能力，因此当anomalies出现的时候，产生higher errors。
很多video anomaly detection的工作聚焦于采用 autoencoders来解决任务，依赖于这些model在out-of-distribution数据上的很差的重建能力。因为训练旨在normal examples上进行。当anomalies出现的时候，期待AEs展示出很高的重建errors。然而，一些研究者观察到， AEs泛化很好，能够高精度地重建anomalies。因此，为了更好地利用AEs在anomaly detection上的reconstruction error，研究探索了一些alternatives。从use of dummy 或者pseudo-anomalies到memory modules的集成。有着相同的目的，作者提出，在anomaly detection中采用masked auto-encoders，引入了一些新的方式，来约束generalization capacity。作者超越了标准MAE框架，提出了三个新的changes，来增强model的anomaly detection performance。首先，**作者基于motion gradients，提出了weight tokens在reconstruction loss中提高higher motion的 tokens的重要性。这使得model聚焦于重建high motion的tokens，避免重建background scene; 第二，作者添加了一个cls head，来判别normal和pseudo-abnormal instances in the latent encoding space; 第三，作者将一个teacher decoder和一个student decoder集成到MAE框架中，student decoder从优化好的teacher中蒸馏knowledge。为了降低处理的时间，作者用了一个共享的encoder for teacher and student models，从而实现一种称为自蒸馏的过程。在自蒸馏过程中，shared encoder是forzen，我们利用teacher和student decoders输出之间的差异，结合教师模型的重建误差，来提升异常检测的性能。先进的anomaly detectors主要依赖于object detection方法来增加precision，将每个GPU的处理带宽限制为一个video stream。作者开发了一个轻量的model，能够以25fps处理 66 video streams，显著地降低了processing costs。不同于竞争的models，在object 或者spatio-temporal cube上进行anomaly detection，作者提出了一个model，将整个的video frames作为输入，更高效。
异常检测主要表述为一个one-class learning problems，训练的时候只有normal data，在test times，normal和abnormal都有，一些异常检测的方法分类，包括dictionary learning方法，probabilistic models，change detection frameworks，distance-based models， reconstruction-based方法。考虑到reconstruction-based方法经常在anomaly detection中达到SOTA，最纪念很多工作采用了reconstruction-based范式。根据anomaly detection的执行的level，方法也可以分为spatio-temporal cube-level和frame-level，还有object-level。
- Frame-level and cube-level：在深度学习之前，之前的abnormal event detection model主要依赖于taking short video sequences，将它们分为spatio-temporal cuboids，这些cubes被认为是独立的examples，作为model的输入。也有研究将整个video frames作为输入，例如Liu等人提出了一个有效的算法，学习重建short video sequences的next frame，一个更复杂的方法是采用optical-flow reconstruction来预测input image的anomalous regions，在不同的study中，有人提出了通过GAN，在frame level来检测anomalies。 frame-level和cube-level有一些common 的特点，处理速度相对加快。frame-level在时间上有更强的优势，因为cube-level的方法需要将每个cube当成一个独立的example。处理mini-batch of frames比一些mini-batchs of spatio-temporal cubes更高效。然而，cube-level的方法通常是超过frame-level的，作者提出了一个masked AE，将整个frames作为输入，学习video patches之间的interactions。
为了提高frame-level或者cube-level的方法，研究人员探索了多个components的，例如memotry modules，masked conv blocks。尽管将额外的modules集成到frameworks中导致性能提升，这个过程通常面临效率问题，相比之下，作者的目标是实现一个performance和speed之间的一个trade-off，更关注于efficiency。因此，作者设计了一个lightweight masked AE，基于conv vision transformer(CvT) blocks，提出了一些升级。例如，作者采用knowledge distillation来利用teacher和student model之间的差异，然而，为了保持processing time来,作者采用了self-distillation，用了一个shared encoder for teacher and student models。
- Object-level methods：为了降低false positive detections，一些研究提出关注anomalous objects而不是anomalous frames or cubes，object-centric方法，用来自object detector的先验，使得anomaly detector仅关注object detector，这种类型的framework大幅提高了accuracy，达到SOTA。然而，一个缺点是: 整个框架的推理速度受object detector速度的影响，比一般的anomaly detection framework要慢。因此，处理时间是严重受限的，相比之下，作者在frame-level执行anomaly detection
- Masked auto-encoders in anomaly detection：Kaiming提出了MAE作为预训练方法，来得到strong backbones for downstream tasks，自此之后，这个方法用在了很多领域，例如video processing或者多模态学习。masking framework也用于anomaly detection。据作者所知，作者是第一个提出masked transformer-based AE，用于video anomaly detection。另外，作者在应用标准的MAEs之外，提出了一些修改，包括用更高的motion来emphasizing tokens，用synthetic anomalies来增强training videos，采用self-distillation。
- Knowledge Distillation in anomaly detection：知识蒸馏原本是用于将一个或多个large model(teachers)压缩到一个更轻量的model(student)，最近提出的anomaly detection，知识蒸馏被认为是有用的，应为利用teacher和student 之间的representation 的差异，在anomalies的情况下更大。不同于之前的研究，作者是第一个在anomaly detection中引入self-distillation的，self-distillation在多个depths上添加了cls heads，来提高classifier的分类性能。相比之下，作者将self-distillation集成到一个masked AE中，采用两个不同depths的decoders。由于shared encdoer，作者能够利用teacher和student之间的reconstruction discrepancy，同时又很小的计算开销。
teacher-student transformer-based masked AE，采用两阶段的training pipeline：在第一阶段，通过reconstruction loss，采用了新的、基于motion gradients的weighting机制，来优化teacher masked AE；在第二阶段，优化student masked AE的最后的decoder block，和teacher共享backbone的大多数，来preserve efficiency，接下来，作者描述了如何用synthetic anomalies来构造training videos，和训练masked AEs来联合预测anomaly maps，在training frames中忽略anomalies。最后，作者引入了一个cls head，来区分带有和不带有synthetic anomalies，进一步提高了performance。整个的架构是visual transformer blocks，不同于原始的方法，作者将ViT blocks替换成了CvT blocks。为了加速，作者将dense layers中用pointwise conv代替，
- Motion gradient weighting：因为iamge有foreground和background variations，然而，abnormal event detection datasets包含来自固定相机的、带有static background的videos，通过MAE学习static background 是trivial且useless。因此，在video anomaly detection中，重建masked tokens是次优解。作者提出，当计算reconstructino loss的时候，考虑motion gradients。
\(x_t\) 是video frame的索引t， n是non-overlapping visual tokens。跟着Ionescu的工作，通过计算两个连续帧的difference来估计frame \(X_t\)的gradient map，在这之前通过一个 \(3 \times 3\) median filter。接下来，将gradient magnitude map \(g_t\) 划分为non-overlapping patches，得到了gradient patches的set。在每个gradient patch中，逐通道计算最大gradient magnitude。然后，Maximum gradient magnitudes上的channel-wise mean。最后，对于reconstruction loss，计算token-wise weights。

引入weights到传统的token-level reconstruction loss，使MAE去关注重建high motion magnitude的patches。尽管reconstruction loss关注于high motion的tokens，masked tokens仍然使随机选择的。
- Self-distillation：知识蒸馏在异常检测中展示出了它的utility，直觉上，因为teacher 和student models都是用normal data训练的，它们的reconstructions对于normal test samples是类似的。然而，在abnormal samples上，它们的behavior并不保证是similar。因此，teacher-student output gap(discrepancy，差异)，可以作为一种量化给定sample的anomaly level的方法。不幸地是，这个方法在推理的时候涉及teacher 和student model，将速度减半了。为了降低在推理的时候用另外一个model的额外的负担，作者提出采用共享encoder和两个decoder(teacher, student)的自蒸馏的一个新的变体。另外，student branches是在第一个teacher decoder的transformer block分出来的，另外加了一个transformer block。
- 作者的training process是两阶段的。在第一个阶段，teacher用定义的loss进行训练。在第二阶段，作者freeze 共享的backbone的weights，通过self-distillation来仅训练student decoder。则会个self-distillation loss类似于公式中定义的。主要的区别在于，不是重建real image的patches，student 学习重建the ones produced by the teacher。
- MAEs往往对ood数据泛化太好了，这个behavior在异常检测中是不理想的，因为基于AEs的方法依赖于对abnormal examples的high reconstruction errors, low reconstruction errors for normal ones。为了这个目的，作者提出用abnormal events来增强training videos。因为从real-world中搜集abnormal training examples是不可能的，作者用synthetic anomalies，lion给最近的UBnormal data set和它的accurate pixel-level annotations来crop out abnormal events，将它们融合到training videos中，同时确保增加的events的时序一致性。synthetic examples在三个方面帮助model，首先: 在reconstruction loss中，作者考虑原始的training frames(without superimposed anomalies)作为ground-truth，使得model忽略anomalies。在等式中，用来自frame \(X_t\)的normal version的patches，第二，作者将anomaly map作为另外一个channel，加到target iamge上，在anomaly map中，将normal pixels设为0，abnormal pixels设为1。这个changes意味着，所有的patches都有一个额外的channel。第三，用gt anomaly map来增强weights。这个增加的synthetic anomalies并没有产生有high magnitude的motion gradient。因此，对应于anomaly regions的patches是有可能得到low weights，如果想要model检测anomaly，这不是理想的。为了这个目的，在计算weights之前，作者提出了一起添加anomaly maps和gradients。
- Classification head：作者进一步利用synthetic anomalies来训练一个cls head，用在shared encoder的最后的final [cls] token上。这个head被训练用来区分带有和不带有synthetic anomalies的frames，这个head用binary cross-entropy进行训练。
- Inference：在推理的时候，作者将每个frame \(X_t\) 通过teacher和student models来得到reconstructed frames \(\hat_{t}\) 和 \(x_t\)，然后计算输出的pixel-level anomaly map： \[o_t = \alpha \|x_t - \hat{x}_t\|^2 + \beta \cdot \|\tilde{x}_t - \bar{x}_t\|^2 + \gamma \cdot \dot{y}_t,\] \(\alpha, \beta, \gamma\) 是控制individual anomaly score components 的重要性。用spatio-temporal 3D filtering来光滑anomaly volumes，为了得到frame-level anomaly scores，保留每个output map \(o_t\) 的最大的value，一次应用另外的temporal Gaussian filter来smooth values。

Overview \(Fig.1^{[1]}\) 在训练的时候，video frames用synthetic anomalies进行增强。teacher decoder学习重建original frames(without anomalies)和预测anomaly maps。student decoder学习reproduce teacher's output。motion gradient在token level进行汇总，用作reconstruction loss的weights，红色虚线表示只在training的时候执行的steps。

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors[1]

Time

Key Words

总结

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors^[1]