Monday, December 26, 2016

Checking for (Near-)Identical Images in a Large Dataset

Quite often, we run into the problem of duplicate/near-duplicate image detection in large image datasets. This is especially relevant for deep learning approaches which learn better with increased diversity in training image sets.

The problem of exact duplicate can be easily resolved by hashing the image content and comparing it. However, sometimes because of minor distortions (such as a small crop) and/or image noise, we have images that look almost the same but are not exact duplicates. In such cases, it is advisable to extract features from the entire dataset and look at similarity of the feature vector to find near duplicate images.

There are various ways of doing this. I will list the two common feature extraction techniques that are often used,


  1. GIST feature extraction and comparison"The second check uses GIST [4] descriptor matching, which was shown in [2] to have excellent performance at near-duplicate image detection in large (> 1 million) image collections." - from Ross Girshick et. al supplementary material https://people.eecs.berkeley.edu/~rbg/papers/r-cnn-cvpr-supp.pdf
  2. Extracting CNN (eg: Alexnet) features on all the images and computing Euclidean distance between the feature representation of various images and then using a small threshold to find near-duplicate image.