Abstract
This paper describes speaker-independent multimodal speech recognition toward constructing multimodal spoken dialogue systems. In order to build a multimodal speech recognition system, an audio-visual speech database was first collected from 25 male speakers. In our system, a multi-stream HMM technique is used for integrating audio and visual information. We propose a multi-stream HMM construction method where audio-only and visual-only models are separately trained and then integrated at the state level. In this framework, the state tying structure of the target audio-visual model is inherited from the audio-only triphone HMM. Experimental results show that the proposed method is effective in various noise conditions. We also compared two visual features, optical-flow-based features and PCA(Principal Component Analysis)-based features, in our recognition framework. The results show that the optical-flow-based features yield better performance than the PCA-based features.
| Translated title of the contribution | A Study on Multimodal Speech Recognition for Spoken Dialogue Systems |
|---|---|
| Original language | Japanese |
| Pages (from-to) | 19 - 24 |
| Journal | IEICE technical report |
| Volume | 107 |
| Issue number | 77 |
| State | Published - 24 May 2007 |