Monday, December 26, 2016

Checking for (Near-)Identical Images in a Large Dataset

Quite often, we run into the problem of duplicate/near-duplicate image detection in large image datasets. This is especially relevant for deep learning approaches which learn better with increased diversity in training image sets.

The problem of exact duplicate can be easily resolved by hashing the image content and comparing it. However, sometimes because of minor distortions (such as a small crop) and/or image noise, we have images that look almost the same but are not exact duplicates. In such cases, it is advisable to extract features from the entire dataset and look at similarity of the feature vector to find near duplicate images.

There are various ways of doing this. I will list the two common feature extraction techniques that are often used,


  1. GIST feature extraction and comparison"The second check uses GIST [4] descriptor matching, which was shown in [2] to have excellent performance at near-duplicate image detection in large (> 1 million) image collections." - from Ross Girshick et. al supplementary material https://people.eecs.berkeley.edu/~rbg/papers/r-cnn-cvpr-supp.pdf
  2. Extracting CNN (eg: Alexnet) features on all the images and computing Euclidean distance between the feature representation of various images and then using a small threshold to find near-duplicate image.

Wednesday, March 27, 2013

Feature Normalization for Learning Classifier

Data normalization is sometimes necessary when  learning a classifier like Support Vector Machines (SVM). Its essential especially when combining features which might have different ranges (min to max) for learning a classifier. Additionally, if some feature dimensions have high variations, it will take longer to learn a classifier like SVM and that particular feature dimension may dominate learning which will impact the generalization property over test data of a classifier. There are two different ways in which you could do normalization.

1) Range Scaling : Getting all the feature dimensions to like between certain range $[0,1]$ or $[-1,1]$. This can be easily accomplished by just defining  $x_{i,j} = \frac{(x_{i,j}-\min{x_{:,j}})(u-l)}{\max{x_{:,j}}-\min{x_{:,j}}}$ where $i = 1,2,...,N$ is the number of data samples and $j = 1,2,...,M$ is the feature dimensions, $l$ and $u$ are the lower and upper value of ranges that we want our data to scale to.

2) Standard Normal : Another way of normalization is to use standard normal by scaling each feature dimensions to be a gaussian with $0$ mean standard deviation $1$ in each feature dimension.

More details on both the methods here.

I now show an example where I perform standard normalization. I generated two class data which seems linearly separable.
Now if one normalizes the data separately by using positive and negative class, we end with data that in inseparable as in the figure below.
So, whenever one is performing normalization, one should use both positive and negative class data together and then the normalization process will yield desired results.
Additionally, one should store the mean and variance for each dimension in feature space and use it on testing data so that it lies in the same range. This is necessary because in all pattern recognition problems we assume that our test data follows the same distribution as training data.

Here is the matlab code that generated the data and normalization in pictures above.


Tuesday, March 26, 2013

Learning SVM with imbalanced data

Recently, I wrote a paper which used Support Vector Machines (SVM) classifier for a standard 2-class problem. SVM is widely used in machine learning for classification and regression perhaps because of its good generalization properties (comes due to maximizing margin) and readily available code (LibSVM Library).

However, standard SVM learning doesn't give desired results when the learning data is unbalanced i.e. the number of positive samples is not comparable to number of negative samples. Consider a hypothetical problem where 10% of your testing data samples are positive and rest 90% are negative class. A simple SVM classifier which labels everything as negative class will end up giving 90% accuracy on learning data which a bad results because you want to find positive samples as well. Accuracy score is a bad measure for this type of data. One could use measures such as G-mean. More on that here. There are two simple ways of overcoming this.

1) Under-sampling : One thing we can do is to take samples from negative data so that number of positive and negative learning examples are roughly the same. In MATLAB you can use randperm function to accomplish this.
2) Using weights on learning : You can use different weights associated with slack variables for different class which will intuitively weigh each class inversely proportional to ratio of their samples. LibSVM package provides support for this. More details on how to set this parameters can be found here.

During my experiments, I found under-sampling to me more effective rather than using weights on learning perhaps because SVM really just cares about few data point (Support Vectors) which make up the learnt hyperplane. Perhaps it will be interesting to first learn a SVM classifier by undersampling and then testing on the remaining (non-sampled data not used in learning) data and doing some hard example mining (data points that are wrongly classified using learnt SVM and have maximum deviation from margin or separating hyperplane). If you have any question or want to share your experiences, please post in the comments.

P.S :  I found a paper that discusses exactly these things in more details here http://library.natural-selection.com/Library/2010/An_approach_for_classification_of_highly_imbalanced_data.pdf

Friday, July 27, 2012

Tips and Tricks for Linux

I am going to put down useful commands for Linux here and continuously update it. If you want to contribute to this page, please post your suggestion in comments.
1) Installing from a .iso file
Easiest thing to do is to mount the .iso in a tmp location and install it from there.
sudo mkdir /tmp/matlab 
sudo mount /<path to .iso file>/matlab.iso /tmp/matlab -t iso9660  -o loop=/dev/loop0
2) Using FFmpeg to break a video into parts
ffmpeg -i STOPS_20111214_CR1_02_C5.mov -sameq -ss 00:00:00 -t 00:00:30 videopart1.mov

So this code takes in a video named STOPS_20111214_CR1_02_C5.mov, uses all the parameters of video such as framerate, bit rate to make a new video and extracts all the frames from start time of 0 seconds to end time of 30 secs and writes it as videopart1.mov
3) Overcoming too many threads error of ffmpeg
http://crazedmuleproductions.blogspot.com/2007/10/multithreading-in-ffmpeg-and-mpstat.html

Wednesday, June 27, 2012

Renaming Files in a folder using Shell in Linux

I work with a lot of different image dataset and find it helpful if the image files are named in a particular order that I want. Below is the script that will convert all the 'png' files in a folder into frame_%05d.png format.

#!/bin/sh
prefix=$1

count=1
for f in *.png; do
    nn=`printf %05d $count`
    mv "$f" "frame_$prefix$nn".png
    count=`expr $count + 1`
done

Save this script as a .sh file (maybe frameconvert.sh) and then you can run it from command prompt using 'sh frameconvert.sh' . You can modify this script for your own purposes.

Friday, June 8, 2012

Getting the Kinect to Work

This post is about how I got kinect to work on my machine which is Ubuntu 10.04 Lucid using ROS
What didn't Work:
I was trying to get Kinect for Windows working on my system but apparently it's not supported by Open NI drivers. Find the discussion about it here: http://answers.ros.org/question/12876/kinect-for-windows/ . I then installed a Windows 7 virtual environment using VMplayer. Microsoft released SDK for windows which is pretty cool. You can download the SDK from here. I was really hoping for it to work but came to know that current version of SDK doesn't support virtual environments yet. You can find some discussion here. http://social.msdn.microsoft.com/Forums/br/kinectsdk/thread/86528a22-0643-4a1f-819f-8125d7668a68 I didn't want to do a dual boot and started exploring other options. 
What works:
Open NI does support the older Xbox 360 sensor, which we had in the lab. I tried it out and it worked perfect (close to). Major steps are outlined as follows
1) Install ROS. I installed fuerte for which you can find the installation directions here for Ubuntu.
2) Install Open NI drivers using apt-get install ros-fuerte-openni-kinect or follow the directions here.

That's it and you are done. To launch Xbox 360 sensor use
 roslaunch openni_launch openni.launch . You will see in command window
 [ INFO] [1339168119.174802439]: Number devices connected: 1
[ INFO] [1339168119.174938804]: 1. device on bus 002:21 is a Xbox NUI Camera (2ae) from Microsoft (45e) with serial id 'B00367200497042B'
If you want to visualize the rgb image, you can use 
rosrun image_view image_view image:=/camera/rgb/image_color
For depth image use
rosrun image_view disparity_view image:=/camera/depth_registered/disparity

If you want to save data using kinect, follow the instruction here.
Additionally you can install rviz which is visualization utility with ros using
sudo apt-get install ros-fuerte-visualization 
 You can visualize RGB image, depth image, point cloud using Rviz.  
Some more things to do are to import this data into MATLAB by following the tutorial here .
Also I want to use PCL  because they have pretty neat things in there. 

If you have any questions or feedback, please comment below. If you are interested in my work follow my website (just beginning to make it)