大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【计算机科学】【2019.05】基于深度神经网络的图像字幕生成集成学习

已有 1044 次阅读 2021-8-10 19:16 |系统分类:科研笔记|文章来源:转载

图片


本文为美国亚利桑那州立大学(作者:Harshitha Katpally)的硕士论文,共131页。

 

将图像中的信息转换成自然语言句子被认为是计算机所要解决的一个难题。图像字幕不仅包括从图像中检测目标,还包括理解要翻译成相关字幕的目标之间的相互作用。因此,计算机视觉和自然语言处理领域的专业知识对这一目标至关重要。深度神经网络的序列到序列建模策略是通过传统方法来生成一系列的单词,由这些单词组合起来表示图像。但是这些模型由于不能很好地推广到训练数据上,存在着方差大的问题。

 

本文的研究重点是减小方差因子,这将有助于生成更好的字幕。为了实现这一目标,人们探索了集成学习技术,该技术以解决机器学习算法中的高方差问题而著称。本文研究了三种不同的集成技术,即k-折叠集成、bootstrap聚合集成和boosting集成。对于每一种技术,分析了三种输出组合方法。在Flickr8k数据集上进行了大量的实验,该数据集收集了8000幅图像,每幅图像有5种不同的字幕。bleu评分性能指标被认为是评价自然语言处理(NLP)问题的标准,用于评价预测结果。基于这一指标,研究分析表明,集成学习的性能明显更好,与任何独立的模型相比,能够生成更有意义的字幕。

 

Capturing the information in an image intoa natural language sentence is considered a difficult problem to be solved bycomputers. Image captioning involves not just detecting objects from images butunderstanding the interactions between the objects to be translated intorelevant captions. So, expertise in the fields of computer vision paired withnatural language processing are supposed to be crucial for this purpose. Thesequence to sequence modelling strategy of deep neural networks is thetraditional approach to generate a sequential list of words which are combinedto represent the image. But these models suffer from the problem of highvariance by not being able to generalize well on the training data. The mainfocus of this thesis is to reduce the variance factor which will help in generatingbetter captions. To achieve this, Ensemble Learning techniques have beenexplored, which have the reputation of solving the high variance problem thatoccurs in machine learning algorithms. Three different ensemble techniquesnamely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemblehave been evaluated in this thesis. For each of these techniques, three outputcombination approaches have been analyzed. Extensive experiments have beenconducted on the Flickr8k dataset which has a collection of 8000 images and 5different captions for every image. The bleu score performance metric, which isconsidered to be the standard for evaluating natural language processing (NLP)problems, is used to evaluate the predictions. Based on this metric, theanalysis shows that ensemble learning performs significantly better andgenerates more meaningful captions compared to any of the individual modelsused.

 

1.  引言

2. 项目背景

3. 相关工作

4. 图像字幕系统结构

5. 图像字幕的集成学习

6. 分析

7. 结论与展望

附录 部分python源代码


更多精彩文章请关注公众号:205328s611i1aqxbbgxv19.jpg




https://wap.sciencenet.cn/blog-69686-1299167.html

上一篇:[转载]【信息技术】【2007.09】基于PC的实时音频信号处理
下一篇:[转载]【电信学】【2018.01】Arm Mbed–AWS物联网系统集成
收藏 IP: 60.169.68.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-25 02:10

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部