Momentor++: Advancing Video Large Language Models With Fine-Grained Long Video Reasoning [0.03%]
Momentor++:利用细粒度长视频推理推进视频大型语言模型的发展
Juncheng Li,Minghe Gao,Xiangnan He et al.
Juncheng Li et al.
Large Language Models (LLMs) exhibit remarkable proficiency in understanding and managing text-based tasks. Many works try to transfer these capabilities to the video domain, which are referred to as Video-LLMs. However, current Video-LLMs ...
Generalizable Egocentric Task Verification Via Cross-Modal Hybrid Hypergraph Matching [0.03%]
跨模式混合超图匹配的egocentric任务通用验证方法研究
Xun Jiang,Xing Xu,Zheng Wang et al.
Xun Jiang et al.
Egocentric Task Verification (ETV) aims to determine if the operation flows of procedural tasks in egocentric videos align with the logic of given rules. Early works adopt the video-based verification paradigm that compares a reference vide...
Scalable Semi-supervised Learning with Discriminative Label Propagation and Correction [0.03%]
具有判别式标签传播与纠正的可伸缩半监督学习方法
Bingbing Jiang,Jie Wen,Zidong Wang et al.
Bingbing Jiang et al.
Semi-supervised learning can leverage both labeled and unlabeled samples simultaneously to improve performance. However, existing methods often present the following issues: (1) The emphasis of learning is put on either the similarity struc...
Defying Distractions in Multimodal Tasks: A Novel Benchmark for Large Vision-Language Models [0.03%]
抵制干扰的多模态任务:大型视觉语言模型的新基准测试
Jinhui Yang,Ming Jiang,Qi Zhao
Jinhui Yang
Large Vision-Language Models (LVLMs) with "multimodal distractibility," where plausible but irrelevant visual or textual inputs cause significant drops in reasoning consistency and lead to unreliable outputs. This paper introduces a compreh...
Toward Accurate Image Generation via Dynamic Generative Image Transformer [0.03%]
基于动态生成图像变换器的精确图像生成方法
Zhendong Mao,Mengqi Huang,Yijing Lin et al.
Zhendong Mao et al.
Existing generative image transformers follow a two-stage generation paradigm, where the first stage learns a codebook to encode images into discrete codes via vector quantization, and the second stage completes the image generation based o...
A General Image Fusion Approach Exploiting Gradient Transfer Learning and Fusion Rule Unfolding [0.03%]
一种基于梯度迁移学习和融合规则的通用图像融合方法
Wu Wang,Liang-Jian Deng,Qi Cao et al.
Wu Wang et al.
The goal of a deep learning-based general image fusion method is to solve multiple image fusion tasks with a single model, thereby facilitating the deployment of models in practical applications. However, existing methods fail to provide an...
Enhance Before Fusion: Multi-View Graph Clustering With Graph Trend Filter [0.03%]
增强后再融合:基于图趋势滤波的多视图谱聚类
Penglei Wang,Jitao Lu,Danyang Wu et al.
Penglei Wang et al.
Recently, Multi-View Graph Clustering (MVGC) methods have achieved significant progress, leading to their wide adoption in various applications. However, most MVGC methods merely pursue consistent information by simply fusing multi-view gra...
Isolating Interference Factors for Robust Cloth-Changing Person Re-Identification [0.03%]
鲁棒的换衣行人重识别隔离干扰因素
De Cheng,Yubo Li,Chaowei Fang et al.
De Cheng et al.
Cloth-Changing Person Re-Identification (CC-ReID) aims to recognize individuals across camera views despite clothing variations, a crucial task for surveillance and security systems. Existing methods typically frame it as a cross-modal alig...
Fangjinhua Wang,Qingtian Zhu,Di Chang et al.
Fangjinhua Wang et al.
3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene capture...
Non-Gradient Hash Factor Learning for High-Dimensional and Incomplete Data Representation Learning [0.03%]
基于非梯度哈希因子学习的高维与不完整数据表示学习方法
Di Wu,Shihui Li,Yi He et al.
Di Wu et al.
High-dimensional and incomplete (HDI) data are ubiquitous in various Big Data-related industrial applications, such as drug innovation and recommender systems. Hash learning is the most efficient representation learning approach to extract ...