Abstract
We present experimental results that show better speaker nonnalization using our previously reported frequency warping function that is derived purely from speech data. In our previous work, we have numerically computed the frequency warping function for non-uniform scaling, which is similar to mel-scale, such that spectral envelopes from different speakers enunciating the same sound are similar except for a possible translation factor. In this paper, we do a maximum likelihood search for these translation parameters and show that this non-uniform normalization scheme provides about 18 % improvement over the normalization method based on the maximum likelihood estimate of uniform scaling parameters and about 30 % improvement over mel filterbank cepstral coefficient based baseline for a telephone based continuous digit recognition task. The other attractive attribute of the proposed method is the simplicity in generating features with different shifts compared to generating features with different warping factors in earlier methods.