TY - GEN
T1 - A quadratic mean based supervised learning model for managing data skewness
AU - Liu, Wei
AU - Chawla, Sanjay
PY - 2011
Y1 - 2011
N2 - In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.
AB - In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.
KW - Convex optimization
KW - Data skewness
KW - Quadratic mean
UR - http://www.scopus.com/inward/record.url?scp=84857182086&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972818.17
DO - 10.1137/1.9781611972818.17
M3 - Conference contribution
AN - SCOPUS:84857182086
SN - 9780898719925
T3 - Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011
SP - 188
EP - 198
BT - Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011
PB - Society for Industrial and Applied Mathematics Publications
T2 - 11th SIAM International Conference on Data Mining, SDM 2011
Y2 - 28 April 2011 through 30 April 2011
ER -