首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于多语义线索的跨模态视频检索算法
引用本文:丁洛,李逸凡,于成龙,刘洋,王轩,漆舒汉.基于多语义线索的跨模态视频检索算法[J].北京航空航天大学学报,2021,47(3):596-604.
作者姓名:丁洛  李逸凡  于成龙  刘洋  王轩  漆舒汉
作者单位:1.哈尔滨工业大学(深圳) 计算机科学与技术学院, 深圳 518055
基金项目:国家自然科学基金;广东省自然科学基金
摘    要:针对现有的大多数跨模态视频检索算法忽略了数据中丰富的语义线索,使得生成特征的表现能力较差的问题,设计了一种基于多语义线索的跨模态视频检索模型,该模型通过多头目自注意力机制捕捉视频模态内部对语义起到重要作用的数据帧,有选择性地关注视频数据的重要信息,获取数据的全局特征;采用双向门控循环单元(GRU)捕捉多模态数据内部上下文之间的交互特征;通过对局部数据之间的细微差别进行联合编码挖掘出视频和文本数据中的局部信息。通过数据的全局特征、上下文交互特征和局部特征构成多模态数据的多语义线索,更好地挖掘数据中的语义信息,进而提高检索效果。在此基础上,提出了一种改进的三元组距离度量损失函数,采用了基于相似性排序的困难负样本挖掘方法,提升了跨模态特征的学习效果。在MSR-VTT数据集上的实验表明:与当前最先进的方法比较,所提算法在文本检索视频任务上提高了11.1%;在MSVD数据集上的实验表明:与当前先进的方法比较,所提算法在文本检索视频任务上总召回率提高了5.0%。 

关 键 词:跨模态视频检索    多语义线索    多头目注意力机制    距离度量损失函数    多模态
收稿时间:2020-08-26

Cross-modal video retrieval algorithm based on multi-semantic clues
DING Luo,LI Yifan,YU Chenglong,LIU Yang,WANG Xuan,QI Shuhan.Cross-modal video retrieval algorithm based on multi-semantic clues[J].Journal of Beijing University of Aeronautics and Astronautics,2021,47(3):596-604.
Authors:DING Luo  LI Yifan  YU Chenglong  LIU Yang  WANG Xuan  QI Shuhan
Institution:1.School of Computer Science and Technology, Harbin Institute of Technology(Shenzhen), Shenzhen 518055, China2.School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China3.Peng Cheng Laboratory, Shenzhen 518055, China
Abstract:Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space, so that semantically similar data are close to each other and semantically dissimilar data are far from each other, that is, the global similarity relationship of different modal data is established. However, these methods ignore the rich semantic clues in the data, which makes the performance of feature generation poor. To solve this problem, we propose a cross-modal retrieval model based on multi-semantic clues. This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism, and pays attention to the important information of video data to obtain the global characteristics of the data. Bidirectional Gate Recurrent Unit (GRU) is used to capture the interaction characteristics between contexts within multimodal data. Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data. Through the global features, context interaction features and local features of the data, the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect. Besides this, an improved triplet distance measurement loss function is proposed, which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics. Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1% compared with the state-of-the-art methods. Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0% compared with the state-of-the-art methods. 
Keywords:
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《北京航空航天大学学报》浏览原始摘要信息
点击此处可从《北京航空航天大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号