Return of Unconditional Generation: A Self-supervised Representation Generation Method[1]

作者是来自MIT, CSAIL的Tianhong Li, Dina Katabi和何恺明。论文引用:Li, Tianhong et al. “Return of Unconditional Generation: A Self-supervised Representation Generation Method.” (2023).

对物体图像语义信息本质的理解,而不是停留在图像的模式、特征上,增强泛化性。由小到大,由细微到广大、从局部到整体的去理解、去学习特征表示。

Key Words

  • unconditional generation with unlabeled data.
  • self-supervised encoder: MoCov3 ViT-B
  • Representation Generation: RDM 12-block, 1536-hid-dim for 100 epochs
  • Image generation: MAGE-B for 200 epochs
  • Representation-Conditioned Generation(RCG)
  • generate semantic representations in the representation space ### 总结
  1. 生成模型作为无监督方法发展了很长时间,重要的工作如GAN、VAE和Diffusion Model。这些基础方法聚焦于数据的概率分布,不依赖于人类标注的availability。这个问题经常归类为无条件的生成(unconditional generation),是追求利用大量的无标注数据来学习复杂的分布。缩小有条件和无条件的生成是一个有价值的问题。释放出大规模的无标注数据的能量是必须的一步。

    阅读全文 »

Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]

作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He,分别来自FAIR和NYU。论文引用:Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.

Key Words

  • Denoising Diffusion Models
  • Denoising Autoencoder
  • low-dimensional latent space

总结

  1. Denoising 是目前生成模型的核心,例如DDM,这些生成模型效果很好,看起来对于视觉内容有学习表征的能力。两个问题:
    • 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs,这个原本是用来生成的,现在用来评估识别的表征;
    • 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
  2. 文章的思路是:deconstruct DDM,将它逐步改成经典的DAE,通过这个过程检验它的各个方面。发现主要的一个component是tokenizer:create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation
阅读全文 »

有没有能够可视化代码函数的执行和调用的工具,看一些大的Project,函数的调用看起来比较乱,记不住、理不清关联。

双视角, 三维轨迹

BEiT: BERT Pre-Training of Image Transformers[1]

论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用:Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.

Time

  • 2021.Jun

Key Words

  • Self-supervised vision representation model:BEiT
  • pre-training task: masked image modeling(MIM)
  • two views of image representation: image patches(input) and visual tokens(output)

针对的问题

  1. 直接将BERT的思路用于image data有挑战:

    • there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
    • 一个直接的思路是将任务视为回归问题,预测masked patches的raw pixels,然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details.
      阅读全文 »

Masked Autoencoders As Spatiotemporal Learners[1]

作者们是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He。论文引用[1]:Feichtenhofer, Christoph et al. “Masked Autoencoders As Spatiotemporal Learners.” ArXiv abs/2205.09113 (2022): n. pag.

Key Words

  • extension MAE for video

  • minial domain knowledge

阅读全文 »

对VideoMAE进行训练或者微调的遇到的Bug

  1. 训练videoMAE时报错, File "/home/MAE-Action-Detection/run_class_finetuning.py", line 404, in main train_stats = train_one_epoch( File "/home/MAE-Action-Detection/engine_for_finetuning.py", line 59, in train_one_epoch for step, (samples, boxes, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/MAE-Action-Detection/utils.py", line 141, in log_every for obj in iterable: File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) File "av/error.pyx", line 78, in av.error.FFmpegError.init TypeError: init() takes at least 3 positional arguments (2 given)

解决方法是:将torch, torchvision等相应的版本升级到1.13就行了,原来是1.9

  1. AVA数据集格式的问题

参照AlphAction的格式,其中,boxs文件夹下的ava_det.json文件的中的bbox的格式默认是 x1,y1,w,h而不是x1,y1,x2,y2,所以后续的AVADataset里的有个地方,就是Box(mode="xyxy).convert("xywh")。如果bbox的格式是x1,y1,x2,y2,则不需要convert.("xywh"),如果是默认的x1,y1,w,h,则需要convert("xywh")。这是个大坑。。好像也没有看到作者有说明。。。

SlowFast Networks for Video Recognition[1]

作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.

Time

  • 2018.Dec

Key Words

  • Slow pathway to capture spatial semantics
  • lightweight Fast pathway to capture temporal motion and fine temporal resolution

动机

  1. all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
  2. inspired by biological studies on the retinal ganglion cells in the primate visual system,受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%,Magnocellualr(M-cells)约20%,
    • M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
    • P-cells可以检测到空间信息:spatial detail and color, lower temporal resolution
      阅读全文 »

用Detectron2训练自己的目标检测数据集

  1. 主要是需要注册自己的数据集,然后使用数据集进行训练

    from detectron2.data.datasets import register_coco_instances

    register_coco_instances("train", {}, "json_annotation.json", "path/to/image/dir")

  2. 然后就是一些配置文件啥的

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining[1]

作者是Zhan Tong, Yibing Song, Jue Wang 和王利民,分别来自南大,腾讯和上海AI Lab,论文引用[1]:Tong, Zhan et al. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.” ArXiv abs/2203.12602 (2022): n. pag.

Time

Key Words

  • video masked autoencoder using plain ViT backbones, tube masking with high ratio
  • data-efficient learner that could be successfully trained with only 3.5k videos. Data quality more important than quantity for SSVP(self-supervised video pretraining) when a domain shift exists between source and target dataset.

动机

  1. 对于Video Transformers,通常是derived from 基于图像的transformer,严重依赖于从大规模图像数据的pre-trained models,高效地训练一个vanilla vision transformer on the video dataset without any pre-trianed model or extra image data是一个挑战。

    阅读全文 »

《围城》

是钱钟书先生的作品,“围城”一词经常在生活中听到,很早就听说了这本书,近来终于如愿买了一本,想来闲暇时分来读一读,少看些手机网页资讯,多去看看感兴趣的书籍,先挖个坑,以后再慢慢填。后续会看一些书,也会陆续更新到主页。

章节故事

目前读到了方鸿渐留洋回来后,在家里的情节。与苏小姐、唐小姐之间发生了一些故事,后来与周家发生了一些矛盾,苏小姐也结婚了,就没有住在周家了,收到了三闾大学的聘请,回到了家,和父母说了,有见到了赵辛楣,准备去三闾大学了。

后续会继续更新。

0%