Abstract
This paper proposes an multi-modal speech recognition method using lip movement extracted from side-face images for increasing noise-robustness in mobile environments. Although most previous multi-modal speech recognition methods use frontal face (lip) images, these methods are not easy for users since they need to hold a device with a camera in front of their face when talking. Our proposed method capturing lip movement using a small camera installed in a handset is more natural, easy and convenient. Visual features are extracted by optical-flow analysis and combined with audio features. HMMs are built by the multi-stream HMM technique. Experiments conducted using connected digit speech contaminated with white noise show improvement of digit accuracy by using the visual information in various SNR conditions. The best improvement is approximately 6% at 5dB SNR.
| Translated title of the contribution | A Multi - Modal Speech Recognition Using Side - Face Images |
|---|---|
| Original language | Japanese |
| Pages (from-to) | 61 - 66 |
| Journal | IPSJ SIG Notes |
| Volume | 2003 |
| Issue number | 58 |
| State | Published - 27 May 2003 |