Mini-seminar on speech and audio classification
School of Computing, University of Eastern Finland
Thursday 14.6.2012, Edtech lab, Joensuu Science Park
14.00 - 14:25
Audio context recognition in variable
mobile environments from short segments using speaker and language
recognizers
Tomi Kinnunen
University of Eastern Finland
The problem of context recognition from mobile audio data is
considered. We consider ten different audio contexts (such as car, bus,
office and outdoors) prevalent in daily life situations. We choose
mel-frequency cepstral coefficient (MFCC) parametrization and present an
extensive comparison of six different classifiers: knearest neighbor
(kNN), vector quantization (VQ), Gaussian mixture model trained with
both maximum likelihood (GMM-ML) and maximum mutual information
(GMM-MMI) criteria, GMM supervector support vector machine (GMM-SVM)
and, finally, SVM with generalized linear discriminant sequence
(GLDS-SVM). After all parameter optimizations, GMM-MMI and and VQ
classifiers perform the best with 52.01 %, and 50.34 % context
identification rates, respectively, using 3-second data records. Our
analysis reveals further that none of the six classifiers is superior to
each other when class-, useror phone-specific accuracies are considered.
14.25 - 14:50
Regularization of All-Pole Models for
Speaker Verification Under Additive Noise
Cemal Hanilci
Uludag University, Bursa, Turkey
Regularization of linear prediction based mel-frequency cepstral
coefficient (MFCC) extraction in speaker verification is considered.
Commonly, MFCCs are extracted from the discrete Fourier transform (DFT)
spectrum of speech frames. In our recent study, it was shown that
replacing the DFT spectrum estimation step with the conventional and
temporally weighted linear prediction (LP) and their regularized
versions increases the recognition performance considerably. In this
paper, we provide a through analysis on the regularization of
conventional and temporally weighted LP methods. Experiments on the
NIST 2002 corpus indicate that regularized all-pole methods yield large
improvements on recognition accuracy under additive factory and babble
noise conditions in terms of both equal error rate (EER) and minimum
detection cost function (MinDCF).
14.50 - 15:15
Variational Bayes logistic regression
as regularized fusion for NIST SRE 2010
Ville Hautamäki
University of Eastern Finland
Fusion of the base classifiers is seen as a way to achieve high
performance in state-of-the-art speaker verification systems. Typically,
we are looking for base classifiers that would be complementary. We
might also be interested in reinforcing good base classifiers by
including others that are similar to them. In any case, the final
ensemble size is typically small and has to be formed based on some
rules of thumb. We are interested to findout a subset of classifiers that
has a good generalization performance. We approach the problem from
sparse learning point of view. We assume that the true, but unknown,
fusion weights are sparse. As a practical solution, we regularize
weighted logistic regression loss function by elastic-net and LASSO
constraints. However, all regularization methods have an additional
parameter that controls the amount of regularization employed. This
needs to be separately tuned. In this work, we use variational Bayes
approach to automatically obtain sparse solutions without additional
cross-validation. Variational Bayes method improves the baseline method
in 3 out of 4 sub-conditions.