博文

[转载]【信息技术】【2013】以自我为中心的视听场景分析

已有 873 次阅读 2021-4-28 16:53 |系统分类:科研笔记|文章来源:转载

本文为法国格勒诺布尔大学（作者：Xavier Alameda-Pineda）的博士论文，共144页。

在过去的二十年里，本行业已经开发了多种具有视听感知能力的商业产品。它们中的大多数都有一个带有嵌入式麦克风（手机、平板电脑等）的摄像机。其他设备如Kinect，包括深度传感器和/或小型麦克风阵列。此外，还有一些手机配备了立体摄像头对。与此同时，许多面向研究的系统变得可用（例如，像NAO这样的仿人机器人）。由于所有这些系统体积小，它们的传感器位置彼此靠近。因此，这些传感器无法捕捉到全球化的场景，而是从一个角度来看待正在进行的社会互动。我们称之为“以自我为中心的视听场景分析”。

本文从多个方面对这一领域做出了贡献。首先，提供面向应用程序的公开可用数据集，如动作/手势识别、说话人定位、跟踪和日记、声源定位、对话建模等。这项工作已在文献[Alameda Pineda 12b]上发表，并在论文内外使用。我们还研究了AV事件检测问题。发表在[Alameda Pineda 11]中，我们展示了如何对其中一种模式（精确地说是视觉模式）的信任进行建模，并将其用于偏倚方法，从而产生视觉监督EM算法（ViSEM）。这篇论文获得了ICMI'11杰出论文奖。在此基础上，将目标音视频说话人检测方法改进为仿人机器人NAO的在线检测方法。详情见[Sanchez Riera 12b]。

在视听说话人检测工作的同时，我们开发了一种新的视听命令识别方法。在[Sanchez Riera 12a]中，我们探讨了不同的特征和分类器，并确认与只有听觉和只有视频的分类器相比，使用视听数据可以提高性能。后来，在[Alameda Pineda 13c]中，我们使用微小的训练集（每个类5-10个样本）寻找最佳方法。这很有趣，因为真实的系统需要适应并从用户那里学习新的命令。这类系统需要具备可操作性，并提供一些供一般公众使用的示例。最后，在非共面麦克风阵列的特殊情况下，我们对声源定位做出了贡献。这很有趣，因为麦克风的几何形状可以是任意的。因此，这为动态麦克风阵列打开了大门，这些阵列将调整其几何结构以适应某些特定任务。此外，因为商业系统的设计可能受到某些限制，而圆形或线性阵列不适合这些限制。在第一阶段，我们在[Alameda Pineda 12a]上发表了文章，在这里我们介绍了总体框架和一个在一定程度上起作用的算法。后来，我们在[Alameda Pineda 13b]中提交了完整的几何模型和更为有效的算法。

综上所述，我们面临使用自我中心数据针对不同实际问题的视听场景分析。具体方法多种多样，从统计、判别学习到非线性规划，都是建立在信号处理的坚实基础之上。研究成果和贡献已被国际研究界同行评审，并发表在国际顶级会议和期刊上。

Along the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as “Egocentric Audio-Visual Scene Analysis”. This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been published in [Alameda-Pineda 12b] and used later on inside and outside the thesis. We also investigated the problem of AV event detection. Published in [Alameda-Pineda 11], we show how the trust on one of the modalities (visual to be precise) can be modelled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). This paper got the Outstanding Paper Award at ICMI’11. Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. The details can be found in [Sanchez-Riera 12b].

In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. In [Sanchez-Riera 12a] we explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to videoonly classifiers. Later, in [Alameda-Pineda 13c] we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited. At a first stage we published in [Alameda-Pineda 12a], where we presented the general framework and one algorithm working up to a certain extent. Later on we submitted the full geometric model together with a much more solid algorithm in [Alameda-Pineda 13b]. In summary, we face different real problems of AV scene analysis using egocentric data. Methods vary from statistical and discriminative learning to nonlinear programming, always on top of a solid basis of signal processing. Results and contributions have been peer-reviewed by the international research community and published in international top conferences and journals.

大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【信息技术】【2013】以自我为中心的视听场景分析

1. 引言

2. Ravel数据集

3. 语音-视觉说话者定位

4. 语音-视觉命令识别

5. 多通道声源定位

6. 结论

更多精彩文章请关注公众号：

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

刘春静

全部作者的其他最新博文

全部精选博文导读

大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【信息技术】【2013】以自我为中心的视听场景分析

1. 引言

2. Ravel数据集

3. 语音-视觉说话者定位

4. 语音-视觉命令识别

5. 多通道声源定位

6. 结论

更多精彩文章请关注公众号：

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

刘春静

全部作者的其他最新博文

全部精选博文导读

该博文允许注册用户评论请点击登录评论 (0 个评论)