Wednesday, March 27, 2013

Feature Normalization for Learning Classifier

Data normalization is sometimes necessary when  learning a classifier like Support Vector Machines (SVM). Its essential especially when combining features which might have different ranges (min to max) for learning a classifier. Additionally, if some feature dimensions have high variations, it will take longer to learn a classifier like SVM and that particular feature dimension may dominate learning which will impact the generalization property over test data of a classifier. There are two different ways in which you could do normalization.

1) Range Scaling : Getting all the feature dimensions to like between certain range $[0,1]$ or $[-1,1]$. This can be easily accomplished by just defining  $x_{i,j} = \frac{(x_{i,j}-\min{x_{:,j}})(u-l)}{\max{x_{:,j}}-\min{x_{:,j}}}$ where $i = 1,2,...,N$ is the number of data samples and $j = 1,2,...,M$ is the feature dimensions, $l$ and $u$ are the lower and upper value of ranges that we want our data to scale to.

2) Standard Normal : Another way of normalization is to use standard normal by scaling each feature dimensions to be a gaussian with $0$ mean standard deviation $1$ in each feature dimension.

More details on both the methods here.

I now show an example where I perform standard normalization. I generated two class data which seems linearly separable.
Now if one normalizes the data separately by using positive and negative class, we end with data that in inseparable as in the figure below.
So, whenever one is performing normalization, one should use both positive and negative class data together and then the normalization process will yield desired results.
Additionally, one should store the mean and variance for each dimension in feature space and use it on testing data so that it lies in the same range. This is necessary because in all pattern recognition problems we assume that our test data follows the same distribution as training data.

Here is the matlab code that generated the data and normalization in pictures above.


Tuesday, March 26, 2013

Learning SVM with imbalanced data

Recently, I wrote a paper which used Support Vector Machines (SVM) classifier for a standard 2-class problem. SVM is widely used in machine learning for classification and regression perhaps because of its good generalization properties (comes due to maximizing margin) and readily available code (LibSVM Library).

However, standard SVM learning doesn't give desired results when the learning data is unbalanced i.e. the number of positive samples is not comparable to number of negative samples. Consider a hypothetical problem where 10% of your testing data samples are positive and rest 90% are negative class. A simple SVM classifier which labels everything as negative class will end up giving 90% accuracy on learning data which a bad results because you want to find positive samples as well. Accuracy score is a bad measure for this type of data. One could use measures such as G-mean. More on that here. There are two simple ways of overcoming this.

1) Under-sampling : One thing we can do is to take samples from negative data so that number of positive and negative learning examples are roughly the same. In MATLAB you can use randperm function to accomplish this.
2) Using weights on learning : You can use different weights associated with slack variables for different class which will intuitively weigh each class inversely proportional to ratio of their samples. LibSVM package provides support for this. More details on how to set this parameters can be found here.

During my experiments, I found under-sampling to me more effective rather than using weights on learning perhaps because SVM really just cares about few data point (Support Vectors) which make up the learnt hyperplane. Perhaps it will be interesting to first learn a SVM classifier by undersampling and then testing on the remaining (non-sampled data not used in learning) data and doing some hard example mining (data points that are wrongly classified using learnt SVM and have maximum deviation from margin or separating hyperplane). If you have any question or want to share your experiences, please post in the comments.

P.S :  I found a paper that discusses exactly these things in more details here http://library.natural-selection.com/Library/2010/An_approach_for_classification_of_highly_imbalanced_data.pdf