TY - GEN
T1 - I-vectors for image classification
AU - Smith, David C.
N1 - Publisher Copyright:
© 2014 SPIE.
PY - 2014
Y1 - 2014
N2 - Recent state-of-the-art work on speaker recognition and verification uses a simple factor analysis to derive a low-dimensional "total variability space" which simultaneously captures speaker and channel variability. This approach simplified earlier work using joint factor analysis to separately model speaker and channel differences. Here we adapt this "i-vector" method to image classification by replacing speakers with image categories, voice cuts with images, and cepstral features with SURF local descriptors, and where the role of channel variability is attributed to differences in image backgrounds or lighting conditions. A Universal Gaussian mixture model (UGMM) is trained (unsupervised) on SURF descriptors extracted from a varied and extensive image corpus. Individual images are modeled by additively perturbing the supervector of stacked means of this UGMM by the product of a low-rank total variability matrix (TVM) and a normally distributed hidden random vector, X. The TVM is learned by applying an EM algorithm to maximize the sum of log-likelihoods of descriptors extracted from training images, where the likelihoods are computed with respect to the GMM obtained by perturbing the UGMM means via the TVM as above, and leaving UGMM covariances unchanged. Finally, the low-dimensional i-vector representation of an image is the expected value of the posterior distribution of X conditioned on the image's descriptors, and is computed via straighforward matrix manipulations involving the TVM and image-specific Baum-Welch statistics. We compare classification rates found with (i) i-vectors (ii) PCA (iii) Discriminant Attribute Projection (the last two trained on Gaussian MAP-adapted supervector image representations), and (iv) replacing the TVM with the the matrix of dominant PCA eigenvectors before i-vector extraction.
AB - Recent state-of-the-art work on speaker recognition and verification uses a simple factor analysis to derive a low-dimensional "total variability space" which simultaneously captures speaker and channel variability. This approach simplified earlier work using joint factor analysis to separately model speaker and channel differences. Here we adapt this "i-vector" method to image classification by replacing speakers with image categories, voice cuts with images, and cepstral features with SURF local descriptors, and where the role of channel variability is attributed to differences in image backgrounds or lighting conditions. A Universal Gaussian mixture model (UGMM) is trained (unsupervised) on SURF descriptors extracted from a varied and extensive image corpus. Individual images are modeled by additively perturbing the supervector of stacked means of this UGMM by the product of a low-rank total variability matrix (TVM) and a normally distributed hidden random vector, X. The TVM is learned by applying an EM algorithm to maximize the sum of log-likelihoods of descriptors extracted from training images, where the likelihoods are computed with respect to the GMM obtained by perturbing the UGMM means via the TVM as above, and leaving UGMM covariances unchanged. Finally, the low-dimensional i-vector representation of an image is the expected value of the posterior distribution of X conditioned on the image's descriptors, and is computed via straighforward matrix manipulations involving the TVM and image-specific Baum-Welch statistics. We compare classification rates found with (i) i-vectors (ii) PCA (iii) Discriminant Attribute Projection (the last two trained on Gaussian MAP-adapted supervector image representations), and (iv) replacing the TVM with the the matrix of dominant PCA eigenvectors before i-vector extraction.
KW - Baum-welch statistics
KW - Dimension reduction
KW - Factor analysis
KW - Fisher vectors
KW - Gaussian MAP-adapted supervectors
KW - Gaussian mixture model
KW - I-vectors
KW - Image classification
KW - Total variability matrix
UR - http://www.scopus.com/inward/record.url?scp=84922727035&partnerID=8YFLogxK
U2 - 10.1117/12.2060207
DO - 10.1117/12.2060207
M3 - Conference contribution
AN - SCOPUS:84922727035
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - Applications of Digital Image Processing XXXVII
A2 - Tescher, Andrew G.
PB - SPIE
T2 - Applications of Digital Image Processing XXXVII
Y2 - 18 August 2014 through 21 August 2014
ER -