Tuesday, March 26, 2013

Learning SVM with imbalanced data

Recently, I wrote a paper which used Support Vector Machines (SVM) classifier for a standard 2-class problem. SVM is widely used in machine learning for classification and regression perhaps because of its good generalization properties (comes due to maximizing margin) and readily available code (LibSVM Library).

However, standard SVM learning doesn't give desired results when the learning data is unbalanced i.e. the number of positive samples is not comparable to number of negative samples. Consider a hypothetical problem where 10% of your testing data samples are positive and rest 90% are negative class. A simple SVM classifier which labels everything as negative class will end up giving 90% accuracy on learning data which a bad results because you want to find positive samples as well. Accuracy score is a bad measure for this type of data. One could use measures such as G-mean. More on that here. There are two simple ways of overcoming this.

1) Under-sampling : One thing we can do is to take samples from negative data so that number of positive and negative learning examples are roughly the same. In MATLAB you can use randperm function to accomplish this.
2) Using weights on learning : You can use different weights associated with slack variables for different class which will intuitively weigh each class inversely proportional to ratio of their samples. LibSVM package provides support for this. More details on how to set this parameters can be found here.

During my experiments, I found under-sampling to me more effective rather than using weights on learning perhaps because SVM really just cares about few data point (Support Vectors) which make up the learnt hyperplane. Perhaps it will be interesting to first learn a SVM classifier by undersampling and then testing on the remaining (non-sampled data not used in learning) data and doing some hard example mining (data points that are wrongly classified using learnt SVM and have maximum deviation from margin or separating hyperplane). If you have any question or want to share your experiences, please post in the comments.

P.S :  I found a paper that discusses exactly these things in more details here http://library.natural-selection.com/Library/2010/An_approach_for_classification_of_highly_imbalanced_data.pdf

No comments:

Post a Comment