Web-scale Content Based Image Retrieval
Do you remember the concept of Computer Blindness? It is about people intuitively expecting from computer vision algorithms results unreachable by the state-of-the-art methods. Believe or not, recently I fell for that trick too.
I was looking throw Navneet Dalal's slides on Histograms of Oriented Gradients. They contained a lot of frames captured from the movies as examples. There was the following one among them:
It seemed familiar to me. I endeavoured to remember the movie, but I failed.1 So I decided to check out some web-sites that offered the inverse image retrieval.
Sure, first I turned to St. Google. The similar image service disappointed me since it was not able to find the similar image of what I want. Actually, it has some indexed base (not very large), and one can find the similar images only within that base. If one somehow find the image from the base (e.g. using a text query), (s)he is shown a button that allows similar image retrieval.
There are different web-sites for this purpose. TinEye positions itself as a reverse image search engine. However, it failed to find anything. It honestly admitted that nothing similar was found. GazoPa found something, but that was different from what I had expected, though the found images were similar in saturation. It was strange because usually such methods work on greyscale images to be robust to the colour levels shifts.
Then I decided to check if the situation is common, and tried a different image, which I found on my desktop:2
I wanted to find the name of its author and the title3. The result was almost the same. TinEye found nothing, GazoPa found a lot of pictures of different girls.
Why is the web-scale CBIR not possible to date? Because there are simply a lot of images in the web, and it is intractable to perform search in such a large index. The common workflow implies feature extraction and further feature matching. From each image hundreds of features could be extracted. Each feature is a high-dimension vector (e.g. 128D). Suppose we have N images in the index. If we extract 100 features from each (which is fairly the lower bound), to handle the given picture we should match 100 * 100 * N features in 128D space. It is really hard to do it instantly even if N is small. In 128D, indexing structures like kd-trees do not improve performance over exhaustive search, because branch and bound method is unable to reduce the search space in practice (approximate methods partly solve the problem). For example, it took 13 hours to match 150,000 photos on 496 processor cores for the Building Rome in a Day project. This also tells us that there are 150K photos of Rome in the web, so try to imagine the whole number of pictures!
But there is a certain hope for success in the web-scale CBIR. First, when we deal with billions of photos (and trillions of features) kd-trees are likely to give sublinear performance. Second, one could use the web context of an image to seed out come obviously wrong matches.
In the long run, we need to develop more descriptive image representation and learn how to combine the textual content with the content. Also, using the user interest prior could be useful. The engine may start the search from the most popular photos and ignore the least popular ones. Thus, the task could be formulated as a sequential test, where the predictor should be able to say if the found match is good enough, or it should continue search.
UPD (Apr 7, 2010). Here is a rather broad list of visual image search engines.
1 The first thought was about Roman Polanski's "The Pianist", which is surely wrong. If you recognize the movie please let me know!
2 I've seen the picture in the Hermitage Museum and downloaded it after returning from St. Petersburg, because the girl had reminded me my friend Sveta.
19 March 2010 at 13:54
Some low-level image features such as (spatial) histogram of an image can be applied to quickly reject very unsimilar images. Those low-level features can be used to cluster images somehow. After images are clustered you should be looking for the images similar to yours only in its cluster.
I see you're experimenting with blog design, huh?
20 March 2010 at 13:31
Sure, clustering is used nowadays.
What do you mean by spatial histogram?
>> I see you're experimenting with blog design, huh?
Yeah, kind of. The previous one seemed boring.
20 March 2010 at 15:23
By spatial histogram I mean the one that takes spatial distribution of colors in image into account. For example, you can split your image into 16 large square subimages and calculate histogram in each subimage separately.
21 March 2010 at 14:37
Well, it really depends on the target.
First, colours are not robust because of hue channel shifts caused by the type and settings of a camera and lighting conditions. So, if you want to find natural photos, you should be accurate with quantization that distribution. Though there is no such kind of a problem in, for example, logo recognition (but not the photos of a logo).
Second, one usually wants to find some similar images not the same ones. So, there could be images where the Eiffel Tower either on the left or in the right. In this case we may use only local features.
21 March 2010 at 14:56
I agree completely. It's good when you're looking for the photos of the same scenery (like sea coast or steppe), but really bad when you wanna find some specific object like the Eiffel Tower you've mentioned.
Wtf is the guy who posts those strage comments to your blog, btw?
23 March 2010 at 21:37
That guy's Ivan Tolstosheyev.