Mini-seminar on speech and audio classification

School of Computing, University of Eastern Finland

Thursday 14.6.2012, Edtech lab, Joensuu Science Park

14.00 - 14:25
Audio context recognition in variable mobile environments from short segments using speaker and language recognizers
Tomi Kinnunen
University of Eastern Finland

The problem of context recognition from mobile audio data is considered. We consider ten different audio contexts (such as car, bus, office and outdoors) prevalent in daily life situations. We choose mel-frequency cepstral coefficient (MFCC) parametrization and present an extensive comparison of six different classifiers: knearest neighbor (kNN), vector quantization (VQ), Gaussian mixture model trained with both maximum likelihood (GMM-ML) and maximum mutual information (GMM-MMI) criteria, GMM supervector support vector machine (GMM-SVM) and, finally, SVM with generalized linear discriminant sequence (GLDS-SVM). After all parameter optimizations, GMM-MMI and and VQ classifiers perform the best with 52.01 %, and 50.34 % context identification rates, respectively, using 3-second data records. Our analysis reveals further that none of the six classifiers is superior to each other when class-, useror phone-specific accuracies are considered.

14.25 - 14:50
Regularization of All-Pole Models for Speaker Verification Under Additive Noise
Cemal Hanilci
Uludag University, Bursa, Turkey

Regularization of linear prediction based mel-frequency cepstral coefficient (MFCC) extraction in speaker verification is considered. Commonly, MFCCs are extracted from the discrete Fourier transform (DFT) spectrum of speech frames. In our recent study, it was shown that replacing the DFT spectrum estimation step with the conventional and temporally weighted linear prediction (LP) and their regularized versions increases the recognition performance considerably. In this paper, we provide a through analysis on the regularization of conventional and temporally weighted LP methods. Experiments on the NIST 2002 corpus indicate that regularized all-pole methods yield large improvements on recognition accuracy under additive factory and babble noise conditions in terms of both equal error rate (EER) and minimum detection cost function (MinDCF).

14.50 - 15:15
Variational Bayes logistic regression as regularized fusion for NIST SRE 2010
Ville Hautamäki
University of Eastern Finland

Fusion of the base classifiers is seen as a way to achieve high performance in state-of-the-art speaker verification systems. Typically, we are looking for base classifiers that would be complementary. We might also be interested in reinforcing good base classifiers by including others that are similar to them. In any case, the final ensemble size is typically small and has to be formed based on some rules of thumb. We are interested to findout a subset of classifiers that has a good generalization performance. We approach the problem from sparse learning point of view. We assume that the true, but unknown, fusion weights are sparse. As a practical solution, we regularize weighted logistic regression loss function by elastic-net and LASSO constraints. However, all regularization methods have an additional parameter that controls the amount of regularization employed. This needs to be separately tuned. In this work, we use variational Bayes approach to automatically obtain sparse solutions without additional cross-validation. Variational Bayes method improves the baseline method in 3 out of 4 sub-conditions.