MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking[1]

作者是来自旷视等机构的En Yu等人, 论文引用[1]:Yu, En et al. “MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking.” ArXiv abs/2305.14298 (2023): n. pag.

Time

  • 2023.May

Key Words

  • conflict between detection and association
  • detect query only for newly appearing targets
  • track queries for localizing previous detected targets(association part in a implicit manner)

总结

  1. 简单来说,MOTR的问题是在于detection和association之间的冲突,MOTRv2用额外的detection network部分解决了这个问题,作者将这个conflict的归因于detect queries和track queries在训练的时候的unfair label assignment,detect queries 识别targets然后track queries associate them。基于这个观察,作者提出了MOTRv3,用release-fetch supervision 策略来平衡label assignment process。在这个策略中,labels首先released for detection,然后逐渐fetched back for association。另外两个strategy叫做pseudo label distillation和track group denoising,用来进一步提高detection和association的supervision,同时不需要额外的detection network
阅读全文 »

Grouded Language-Image Pre-training[1]

作者是来自UCLA、Microsoft Reserach、UW等机构的Liunian Harold Li, Pengchuan Zhang等人。论文引用[1]:Li, Liunian Harold et al. “Grounded Language-Image Pre-training.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 10955-10965.

Time

  • 2022.Jun

Key words

  • object-level represetation
  • 一句话总结:GLIP将detection转化为一个grounding tasks,通过将每个region/box和text prompt的phrases进行对齐,GLIP联合训练image和language encoder,来预测正确地regions/words的pairings。同时增加了两个modalities之间的fusion,来学习language-aware visual representation

总结:

  1. 论文提出了一个grounded language-image pretraining model,用于学习object-level, language-aware和semantic-rich visual representations。GLIP统一了object detection和phrase grounding for pretraining。这个统一带来了两个好处: 1. 使得GLIP能够从detection和grounding data中学习,提高tasks和bootstrap a good grounding model. 2.GLIP通过self-training的方式,产生grounding boxes,能够利用大量的imag-text pairs,使得学习到的representations semantic-rich
阅读全文 »

3D Gaussian Splatting for Real-Time Radiance Field Rendering[1]

作者是来自法国Inria的Bernhard Kerbl等人。论文引用[1]:

Time

  • 2023.Aug

### Key Words

总结

  1. Radiance Field方法最近用多个photos或者videos,revolutionized novel-viwe synthesis of scenes。然而,实现高质量仍然需要神经网络,这很费时间来训练和熏染,最近,faster methods trade off seepd for quality。对于无边界和完整的scenes(而不是孤立的objects)和1080p分辨率的渲染,没有当前的方法能够实现实时的display rate。作者引入了3个key elements,使得能够是实现SOTA的visual quality,同时保持高竞争力的training times,还能够在1080p下,高质量地实时地novel-view synthesis。首先,从camera calibration期间产生的sparse points开始,用3D Gaussian表针scene,能够保留理想的properties of continuous volumetric radiance fields for scene optimization,同时在empty space中,避免不必要的计算,其次,执行3D Gaussian的interleaved optimization/density control,显著地优化anisotropic covariance,来实现场景的精确的表征;第三,开发了一种fast visibility-aware rendering 算法,能够支持anisotropic splatting,加速训练,能够实时渲染
阅读全文 »

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[1]

作者是来自Salesforce Research的Junnan Li等人,论文引用[1]:Li, Junnan et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” International Conference on Machine Learning (2023).

Time

  • 2023.Jun

Key Words

  • 一句话总结:BLIP-2是一个vision-language pretraining方法,bootstraps from frozen pretrained unimodal models,为了弥补modality gap,提出了Querying Transformer,用两个阶段进行预训练:第一阶段用一个frozen image encoder的vision-language representation learning;第二阶段是用一个frozen LLM的vision-to-language geneative learning stage.

总结

  1. vision-and-language pre-training的成本由于端到端的large-scale models的训练,逐渐变得难以承受。本文提出的BLIP-2,通过现成的冻结预训练图像编码器和冻结的大型语言模型来引导视觉-语言预训练。BLIP-2 使用轻量级的查询变换器(Querying Transformer)来弥合模态间的差距,该Transformer分两个阶段进行预训练。第一阶段bootstraps vision-language representation learning from a frozen image encoder。第二个阶段是bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 在多个视觉语言任务上去得到了SOTA的性能。尽管有更少的需要训练的参数,实现了更好的性能。
阅读全文 »

D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement[1]

作者是来自USTC等机构的Yansong Peng、Hebei Li等人。论文引用[1]:

Time

  • 2024.Oct

Key Words

  • iteratively refining probability distributions, fine-grained intermediate representation
  • transfers localization knowledge from refined distributions to shallower layers through self-distillation

总结

  1. 作者的D-FINE,是一个实时的object detector,通过在DETR models中重新定义regression task,实现了很好地定位效果。D-FINE包含两个key components:Fine-grained distribution refinement(FDR),和Global Optimal Localization Self-Distallation(GO-LSD)FDR将预测固定的坐标的回归过程变为iteratively refining probability distributions,提供了fine-grained的intermediate representation,能够增强localization的精度。GO-LSD是一个双向的优化策略,将来自refined distributions的localization knowledge,通过self-distillation转移到shallow layer,简化了residual prediction tasks for deeper layers。另外,D-FINE在计算密集的modules和操作中,引入了lightweight optimizations,实现了速度和精度的平衡
阅读全文 »

Notes

  1. SSM是用于描述这些状态表示,并根据某些输入预测其下一个状态可能是什么的模型,一般的输入是连续序列。SSM的核心方程: \[ \begin{align*} \text{State equation} & \quad h'(t) = A h(t) + B x(t) \\ \text{Output equation} & \quad y(t) = C h(t) + D x(t) \end{align*}\] 为了能够处理离散数据,对离散数据进行连续化,使用零阶保持技术,zero-order hold(ZOH)。有了连续的信号,就可以根据输入的时间对值进行采样。通过HiPPO初始化,处理长距离依赖

References

  • https://blog.csdn.net/v_JULY_v/article/details/134923301

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising[1]

作者是来自hkust等机构的Feng Li、Hao Zhang等人。论文引用[1]:Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., & Zhang, L. (2022). DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13609-13617.

Time

  • 2022.Dec

Key Words

  • Denosing Training
  • 一句话来说:作者发现DETRs方法收敛慢的一个主要原因在于bipartite matching,这个matching训练的时候不稳定。于是增加了denoising training for boxes和labels,能够加速收敛,提高了性能

总结

  1. 作者展示了denosing training的方法,能够加速DETR的训练,提供了对于DETR-like方法的收敛慢的深刻的理解。作者展示了收敛慢是由于bipartite matching的不稳定造成早期阶段的不一致优化目标。为了解决这个问题,除了匈牙利loss,作者的方法额外的将带有噪声的GT bboxes给到Transformer decoder中,训练模型来重建original boxes,能够有效地降低bipartite graph matching的困难,导致更快的收敛。作者的方法是通用的,能够很容易地插入到任何DETR-like的模型中,实现很好的提升。
阅读全文 »

Learning Data Association for Multi-Object Tracking[1]

作者是来自蒙特利尔理工的Mehdi Miah等人,论文引用[1]:

  • 2024.Mar

### Key Words

总结

0%