Acoustics, Speech, and Signal Processing, IEEE International Conference on
Download PDF

Abstract

In this paper, we propose a two-step processing algorithm which adaptively normalizes the temporal modulation of speech to extract robust speech feature for automatic speech recognition systems. The first step processing is to normalize the temporal modulation contrast (TMC) of the cepstral time series for both clean and noisy speech. The second step processing is to smooth the normalized temporal modulation structure to reduce the artifacts due to noise while preserving the speech modulation events (edges). We tested our algorithm on speech recognition experiments in additive noise condition (AURORA-2J data corpus), reverberant noise condition (convolution of clean speech utterances from AURORA-2J with a smart room impulse response), and noisy condition with both reverberant and additive noise (air conditioner noise in a smart room). For comparison, the ETSI advanced front-end (AFE) algorithm was used. Our results showed that the algorithm provided: (1) for additive noise condition, 57.26% relative word error reduction (RWER) rate for clean conditional training (59.37% for AFE), and 33.52% RWER rate for multi-conditional training (35.77% for AFE), (2) for reverberant condition, 51.28% RWER rate (10.17% for AFE) and (3) for noisy condition with both reverberant and additive noise, 71.74% RWER rate (48.86% for AFE).
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles