Computer Blindness: 2010

Comic strip: Nerd's confession

Posted in Labels: comics, humour, LaTeX, life, paper, software | at 22:21

I continue sharing my Prague comic strips, since I have not been posting for a while. In fact, I was busy writing a paper, so this one may seem somehow connected. Disclaimer: There's nothing personal in the strip. Kids under 18 are not allowed. :)

The next post is going to be about multimedia technologies for art. Stay tuned!

Read Users' Comments (10)

Happy Thaknsgiving!

Posted in Labels: comics, humour, machine learning, philosophy | at 10:30

Here is a comic strip by Ryan Lake that illustrates the concept of Russell's turkey. The idea is simply the induction does not always work the way you expect. There could be some factors, which did not have influence on the training data, yet changing the learned concept dramatically. We should thus keep in mind that machine learning (which is based on induction) is not the panacea.

Bertrand Russell is well-known for his metaphors. Probably the most famous one is Russell's teapot that aims to show how pointless is to believe in god.

Happy Thanksgiving!

PS. I am reading Animal Farm by George Orwell now, it is a great animal metaphor too.

Read Users' Comments (2)

Nice paper acknowledgement

Posted in Labels: conference, CVPR, humour, MRF, paper | at 21:13

It turns out that not only in Russia PhD students could be underpaid so that they cannot afford food:

This is from Olga Veksler's paper "Graph cut based optimization for MRFs with truncated convex priors" from CVPR 2007.

Andrew seems to be a nice person. I should definitely ask Anton Osokin to introduce us on ocasion. =)

Read Users' Comments (3)

Papers, citations, co-authorship, and genealogy

Posted in Labels: academic, computer vision, history, movie, OpenCV, paper, web-site | at 10:10

This summer I accidentally found out that there are two papers citing our CMRT 2009 paper. I was excited a bit about that since they were the first actual citations of me, so I even read those papers.

The first of them [Димашова, 2010] was published in Russian by OpenCV developers from Nizhny Novgorod. They implemented cascade-based face detection algorithm that used either Local Binary Patterns (LBP) or Haar features. The algorithm was released within OpenCV 2.0. They cite our paper as an example of using Random Forest on the stages of the cascade. However, they implemented the classical variation with the cascade over AdaBoost.

Another citation [Sikiric et al., 2010] is more relevant since it came from the road mapping community. They address the problem of recovering a road appearance mosaic from the orthogonal views to the surface. They contrast their approach with ours in the way that theirs do not employ human interaction. In fact, we need human input for recognizing road defects and lane marking, rectification is done in the previous stage, which is fully automatic.

The rest of the post is devoted to some interesting metrics and structures concerning citations, co-authorship and supervising students.

Citations

The number of citations is a weak measure of paper quality. We can also go further and estimate the impact of a journal or a particular researcher based on the number of citations. A generally accepted measure is the journal impact factor, which is simply the mean number of citations by paper published in the journal for some period in time. Individual researchers could be evaluated by the impact factor of the journals they published in, though it is considered as a bad practice. Another not-so-bad practise is h-index. By definition, one's h-index equals H if there are at least H citations of his top H papers¹. It has also been criticised widely.

So, citation-based scoring has a lot of flaws. But what can we use instead? Another interesting approach is introduced by ReaderMeter. They collect the information about the number of people who have added some paper to their Mendeley collections and compute something like h-index. Unfortunately, they recently excluded some papers from their database, so the statistics became less representative but more accurate.

Co-authorship and Erdős number

Paul Erdős was a Hungarian mathematician who published about 14 hundred research papers with 511 different co-authors. That's why he has a special role in bibliometrics. Erdős number is defined as a collaborative distance from a researcher to Erdős. More strictly, Wikipedia defines:

Paul Erdős is the one person having an Erdős number of zero. For any author other than Erdős, if the lowest Erdős number of all of his coauthors is k, then the author's Erdős number is k + 1.

My Erdős number is at most 7 (via Olga Barinova, Pushmeet Kohli, Philip H.S. Torr, Bernhard Schölkopf, John Shawe Taylor, David Godsil). To be honest, Erdős number system is primarily used for math papers. Even if we use the wider definition, it is not that beautiful, because there are not too many gates, e.g. all the vision community will probably connect to Erdős via the machine learning gate. Thus, the researchers who work on the edge will be closer. To illustrate this fact, David Marr probably had not any finite Erdős number during his lifetime (now he definitely has). So, we can introduce, say, Zisserman number for computer vision.² According to DBLP, Andrew Zisserman has 165 direct co-authors so far. Now, my Zisserman number is 3, David Marr's is 4 (via Tomaso Poggio, Lior Wolf, Yonatan Wexler).

The movie industry have their own metrics, which is the Bacon number (after Kevin Bacon). The distance is established 1 if two actors have appeared in a single movie. Someone tried to combine those two to the Erdős-Bacon number. For some person, it is just a sum of her Erdős and Bacon numbers. Of course, very few people have a finite Erdős-Bacon number, since one should both appear in a movie and publish a research paper (and it is still not sufficient). Often they are possessed by researchers who consulted the filming crew and accidentally were filmed. =) Erdős himself has this number 3 or 4 (depending on the details of definition), since he starred in N is a Number (1993) and his Bacon number is thus 3 0r 4.

The person with the probably lowest Erdős-Bacon number is Daniel Kleitman, an MIT mathematician, who has appeared in one of my favourite movies Good Will Hunting (1997) along with Minnie Driver, who collaborated with Bacon in Sleepers (1996). Since Kleitman has 6 joint papers with Erdős, his Erdős-Bacon number is 3.

Surprisingly, Marvin Minsky has his Erdős number (4) greater than his Bacon number (2), which he obtained via The Revenge of the Dead Indians (1993) and Yoko Ono. Another strange example is ~~the paedophile's dream~~ Natalie Portman. She has graduated from Harvard, saying she would rather "be smart than a movie star". That's my kind of a girl! A neuroscience paper [Baird et al., 2002] brought Natalie Hershlag (her real name) Erdős number of 5, and then she appeared in New York, I Love You (2009) along with Kevin Bacon, so she reached the same Erdős-Bacon number as Minsky, i.e. 6.

UPD (Mart 13, 2011). There is a totally relevant xkcd strip.

Scientific genealogy

Finally, I tell about what is known as scientific genealogy. Every grown-up researcher has a PhD advisor, usually one, while an advisor can have a lot of students. Let's just use the analogy with parents and children. We get a tree (or a forest) representing the historical structure of science. The mathematics genealogy project aims to recover this structure.

I tried to track my genealogy back. Unfortunately, I failed to find who was Yury M. Bayakovsky's PhD advisor. But if consider Olga Barinova my advisor, I am the 11th generation descendant of Carl Friedrich Gauß. Nice ancestry, huh?

The similar project exists for computer vision. There are 290 people in the base, though there are duplicates (I've found five Vitorio Ferrari's :). It seems strange that some trees have depth as big as 5 (e.g. Kristen Graumann and Adriana Quattoni are the 5th generation after David Marr), though vision is a relatively young field.

¹ Okay, take the supremum of the set if you want to stay formal
² It seems that Philip Torr has already used that number.

Read Users' Comments (1)comments

LidarK 2.0 released

Posted in Labels: C++, coding, FOSS, GML, laserscanning, MATLAB, software | at 16:39

The second major release of GML LidarK is now available. It reflects our 3-year experience on 3D data processing. The description from the project page:

The LidarK library provides an open-source framework for processing multidimensional point data such as 3D LIDAR scans. It allows building a spatial index for performing fast search queries of different kinds. Although it is intended to be used for LIDAR scans, it can be helpful for a wide range of problems that require spatial data processing.

The API has been enriched with various features in this release. Indeed, it became more consistent and logical. New ways to access data (i.e. various iterators) are implemented. One can now find k nearest neighbours for any point, not just for one that belongs to the index. Since the data structure is a container, we've decided to parametrize it with template parameter. This decision is controversive: one does not need to cast tuples any more, but the code became clumsier and less portable, unfortunately.

The C++/MATLAB code is licensed free of charge for academic use. You can download it here.

Read Users' Comments (0)

ECCV 2010 highlights

Posted in Labels: CMU, computer vision, conference, CVPR, ECCV, GML, graphical models, history, humour, labelling, Microsoft, movie, MRF, paper, parsing, recognition, video | at 22:51

Today is the last day of ECCV 2010, so the best papers are already announced. Although I was not there, a lot of papers are available via cvpapers, so I eventually run into some of them.

Best papers

The full list of the conference awards is available here. The best paper is therefore "Graph Cut based Inference with Co-occurrence Statistics" by Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip Torr. The second best is "Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics" by Abhinav Gupta, Alyosha Efros, and Martial Hebert. Two thoughts before starting reviewing them. First, the papers come from the two institutions which are (arguably) considered now the leading ones in the vision community: Microsoft Research and Carnegie-Mellon Robotics Institute. Second, both papers are about semantic segmentation (although the latter couples it with implicit geometry reconstruction); Vidit Jain already noted the acceptance bias in favour of the recognition papers.

Okay, the papers now. Ladicky et al. addressed the problem of global terms in the energy minimization for semantic segmentation. Specifically, their global term deals only with occurrences of object classes and invariant to how many connected components (i.e. objects) or individual pixels represent the class. Therefore, one cow on an image gives the same contribution to the global term as two cows, as well as one accidental pixel of a cow. The global term penalizes big quantity of different categories in the single image (the MDL prior), which is helpful when we are given a large set of possible class labels, and also penalizes the co-occurrence of the classes that are unlikely to come along with each other, like cows and sheep. Statistics is collected from the train set and defines if the co-occurrence of the certain pair of classes should be encouraged or penalized. Although the idea of incorporating co-occurrence to the energy function is not new [Torralba et al, 2003; Rabinovich et al., 2007], the authors claim that their method is the first one which simultaneously satisfy the four conditions: global energy minimization (implicit global term rather than a multi-stage heuristic process), invariance to the structure of classes (see above), efficiency (not to make the model order of magnitude larger) and parsimony (MDL prior, see above).

How do the authors minimize the energy? They restrict the global term to the function of the set of classes represented in the image, which is monotonic w.r.t. argument set enclosing (more classes, more penalty). Then the authors introduce some fake nodes for αβ-swap or α-expansion procedure, so that the optimized energy remains submodular. It is really similar to how they applied graph-cut based techniques for minimizing energy with higher-order cliques [Kohli, Ladicky and Torr, 2009]. So when you face some non-local terms in the energy, you can try something similar.

What are the shortcomings of the method? It would be great to penalize objects in addition to classes. First, local interactions are taken into account, as well as global ones. But what about the medium level? Shape of an object, size, colour consistency etc. are the great cues. Second, on the global level only inter-class co-occurrences play a role, but what about the intra-class ones? It is impossible to have two suns in a photo, but it is likely to meet several pedestrians walking along the street. It is actually done by Desai et al. [2009] for object detection.

The second paper is by Gupta et al., who have remembered the romantic period of computer vision, when the scenes composed of perfect geometrical shapes were reconstructed successfully. They address the problem of 3D reconstruction by a single image, like in auto pop-up [Hoiem, Efros and Hebert, 2005]. They compare the result of auto pop-up with Potemkin villages: "there is nothing behind the pretty façade." (I believe this comparison is the contribution of the second author). Instead of surfaces, they fit boxes into the image, which allows them to put a wider range of constraints to the 3D structure, including:

static equilibrium: it seems that the only property they check here is that centroid is projected into the figure bearing;
enough support force: they estimate density (light -- vegetation, medium -- human, heavy -- buildings) and say that it is unlikely that building is build on the tree;
volume constraint: boxes cannot intersect;
depth ordering: backprojecting the result to the image plane should correspond to what we see on the image.

This is a great paper that exploits Newtonian mechanics as well as human intuition, however, there are still some heuristics (like the density of a human) which could probably be generalized out. It seems that this approach has a big potential, so it might became the seminal paper for the new direction. Composing recognition with geometry reconstruction is quite trendy now, and this method is ideologically simple but effective. There are a lot of examples how the algorithm works on the project page.

Funny papers

There are a couple of ECCV papers which have fancy titles. The first one is "Being John Malkovich" by Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steve Seitz from the University of Washington GRAIL. If you've seen the movie, you can guess what is the article about. Given the video of someone pulling faces, the algorithm transforms it to the video of John Malkovich making similar faces. "Ever wanted to be someone else? Now you can." In contrast to the movie, in the paper not necessarily John Malkovich plays himself: it could be George Bush, Cameron Diaz, John Clooney and even any person for whom you can find a sufficient video or photo database! You can see the video of the real-time puppetry on the project page, although obvious lags take place and the result is still far from being perfect.

Another fancy title is "Building Rome on a Cloudless Day". There are 11 (eleven) authors contributing to the paper, including Marc Pollefeys. This summer I spent one cloudless day in Rome, and, to be honest, it was not that pleasant. So, why is the paper called this way then? The paper refers to another one: "Building Rome in a Day" from ICCV 2009 by the guys from Washington again, which itself refers to the proverb "Rome was not built in a day." In this paper authors build a dense 3D model of some Rome sights using a set of Flickr photos tagged "Rome" or "Roma". Returning back to the monument of collective intelligence from ECCV2010, they did the same, but without cloud computing, that's why the day is cloudless now. S.P.Q.R.

I cannot avoid to mention here the following papers, although they are not from ECCV. Probably the most popular CVPR 2010 paper is "Food Recognition Using Statistics of Pairwise Local Features" by Shulin Yang, Mei Chen, Dean Pomerleau, Rahul Sukthankar. The first page of the paper contains the motivation picture with a hamburger, and it looks pretty funny. They insist that the stuff from McDonald's is very different from that from Burger King, and it is really important to recognize them to keep track of the calories. Well, the authors don't look overweight, so the method should work.

The last paper in this section is "Paper Gestalt" by the imaginary Carven von Bearnensquash, published in Secret Proceedings of Computer Vision and Pattern Recognition (CVPR), 2010. The authors (presumably from UCSD) make fun of the way we usually write computer vision papers assuming that some features might convince a reviewer to accept or reject the paper, like mathematical formulas that create an illusion of author qualification (although if they are irrelevant), ROC curves etc. It also derides the attempts to apply black-box machine-learning techniques without the appropriate analysis of the possible features. Now I am trying to subscribe to the Journal of Machine Learning Gossip.

Colleagues

There was only one paper from our lab at the conference: "Geometric Image Parsing in Man-Made Environments" by Olga Barinova and Elena Tretiak (in co-authorship with Victor Lempitsky and Pushmeet Kohli). The scheme similar to the image parsing framework [Tu et al., 2005] is utilized, i.e. top-down analysis is performed. They detect parallel lines (like edges of buildings and windows), their vanishing points and the zenith jointly, using a witty graphical model. The approach is claimed to be robust to the clutter in the edge map.

Indeed, this paper could not have been possible without me. =) It was me who convinced Lena to join the lab two years ago (actually, it was more like convincing her not to apply for the other lab). So, the lab will remember me at least as a decent selectioner/scout...

Read Users' Comments (1)comments

Comic strip: 10%

Posted in Labels: comics, Google, history, humour, job, software, web-site | at 22:49

Last summer, during my internship at CMP in Prague I drew some xkcd-style comic strips, and I want to share some of them here. I considered starting a different blog for that, but there are too few of them, and they do suck (at least in comparison to Randall's work). However, most of them are programming/research related, so they are appropriate here.

I planned to use them when I would have nothing to post about. This is not the case now (I am planning a couple more posts in a week or so), but this one is topical. It was originally drawn in September 1, 2009 when GMail was not operating for a few hours. From the recent news: Google shuts Google Wave down, admitting their failure predicted by some sceptics (see the image title text).

Read Users' Comments (2)

Theorem 8.10. P ≠ NP.

Posted in Labels: computational complexity, graphical models, paper, reblog, science | at 01:49

Nearly two weeks ago Indian-American mathematician Vinay Deolalikar, the employee of HP Labs, sent a letter to a group of mathematicians asking them to proof-read his attempt to prove P ≠ NP. The pre-print was eventually distributed over the Internet, so a lot of people became aware. Later two major flaws were found in the proof, so it is now considered wrong. Nevertheless, Vinay has removed the paper from his page and commented that he fixed all the issues and is going to submit a paper to the journal review. You can download the original paper from my web-site [Deolalikar, 2010].

We'll see if he is bluffing or not. In any case, it is very pleasant to me that the main role in the proof is played by... the graphical probabilistic model framework. Yeah, those graphical models we use in computer vision! It was surprising indeed, since the problems in computational complexity are usually discrete and well-posed. So, this post is devoted to understanding why a graphical model emerges in the proof.

First, I am going to review some key points of the proof. It is far from being exhaustive, just some rough sketches that are critical for understanding. Then I report on the flaws found in the proof, and then I end up with a methodological note on the whole P vs. NP problem.

k-SAT and d1RSB Phase

In order to prove P ≠ NP, it is enough to show the absence of a polynomial algorithm for some problem from the class NP, e.g. for some NP-complete problem (which are the "hardest" problems in NP). The boolean satisfiability problem (SAT) addresses the issue of checking if there is at least one assignment of the boolean variables in the formula, represented in conjunctive normal form (CNF), which makes it TRUE. The k-satisfiability problem (k-SAT) is a particular case of SAT, where all the clauses in the CNF have order k. For example, (x˅y)&(¬x˅y)&(x˅¬y)&(¬x˅¬y) is an instance of 2-SAT which has the answer 'NO', since any boolean values of (x, y) makes the formula false. (x˅y˅z)&(¬x˅u˅v)&(¬x˅u˅¬z) is a 3-SAT over 5 variables (x, y, z, u, v), which is satisfiable, e.g. on the input (1,0,0,1,0). 2-SAT is in P, but k-SAT is NP-complete whenever k > 2. The Deolalikar's idea is to show that k-SAT (with k > 8) is outside of P, which (if true) proves that there is a separation between P and NP.

The space of possible k-SAT solutions is analysed. Let m be the number of clauses in the CNF, n be the number of variables. Consider the ratio α = m/n. In the border case m = 1 (α is small), k-SAT is always true, there are usually a lot of inputs that satisfy the CNF. Consider an ensemble of random k-SATs (the following is statistical reasoning). With the growth of α, the probability of the CNF to be satisfiable decreases. When a certain threshold α_d is reached, the "true" solutions set breaks into clusters. This stage is known in statistical mechanics as dynamical one step replica symmetry breaking (d1RSB) phase. Moving further, we make the problem completely unsatisfiable. It turns out that d1RSB stage subproblems are the most challenging of the all k-SAT problems. The effect could be observed only if k > 8, that's why such problems are used in the proof. [Deolalikar, 2010, Chapter 6]

FO(PFP) and FO(LFP)

In the finite model theory there are recurrent extensions of the first-order logic. The predicate P_i(x) is evaluated as some second-order function φ(P_i-1, x), where x is a k-element vector, P₀(x) ≡ false. In the general case, either there is a fixed point, or is P_ilooping. For example, if φ(P, x) ≡ ¬P(x), then P₀(x) ≡ false, P₁(x) ≡ true, P₂(x) ≡ false, etc. Here x is meaningless, but it is not always the case. Consider the following definition: φ(P, x) ≡ max(x) ˅ P(x) // recall x is a vector, in the boolean case 'max' is the same as 'or'. If x contains at least one non-zero element, P₀(x) = false, P₁(x) = true, P₂(x) = true, etc. Otherwise, P_i(0) = false for all i. In the case of looping, let's redefine the fixed point to be constantly false. FO(PFP) is a class of problems of checking if there will be a loop for some x, or a fixed point. They are actually very difficult problems. FO(PFP) is equal to the whole PSPACE. Suppose now that φ is monotonic on P. It means that P(x) only appears in the formula with even number of negations before it (or zero, as in the second example). This means that once P_i(x) is true, P_j(x) will be true for any j > i. This reduces the class to FO(LFP), which is proven to be equal to the class P. So, the global problem now is to show that k-SAT is outside of the class FO(LFP).

ENSP: the graphical model for proving P ≠ NP

So how a graphical model emerges here? Graphical model is a way to describe a multivariate probability distribution, i.e. dependencies of covariates in the distribution. Recall now the definition of NP. If we somehow guess the certificate, we are done (i.e. have a polynomial algorithm). If the space of certificates (possible solutions, in terms of Deolalikar) is simple, we can probably get a polynomial algorithm. What is simple? This means that the distribution is parametrizable with a small amount of parameters (c.f. intrinsic dimensionality), which allows us to traverse the solution space efficiently. This is twofold. First, the distribution is simple if it has a limited support, i.e. all the probability measure is distributed among a limited number of points of the probabilistic space. Second, it is simple if the covariates are as much conditionally independent as possible. In the ideal case, if all the groups of covariates are independent (recall that pairwise and mutual independence do not subsume each other!), we need only n parameters to describe the distribution (n is the number of covariates), while in general case this number is exponential. See [Deolalikar, 2010, p. 6] for examples.

How to measure the degree of conditional independence? Yes, to build a graphical model. It is factorized to cliques according to Hammersley-Clifford theorem. Larger cliques you get, stronger dependency is. When the largest cliques are k-interactions, the distribution can be parametrized with n2^k independent parameters. Finally, Vinay shows that FO(LFP) can deal only with the distributions parametrizable with 2^{poly(log n)} parameters, which is not the case for d1RSB k-SAT (its solution space is too complicated).

In order to show it strictly, Deolalikar introduces a tailored graphical model describing LO(LPF) iterative process, the Element-Neighborhood-Stage Product model (ENSP):

There are two types of sites: corresponding to the variables on the stages of LFP (small circles) and corresponding to the elements of witty neighbourhood system (some closure over Gaifman graph; big blue circles). When a variable is assigned 0 or 1, it is painted with red or green respectively. The last column corresponds to the fixed point, all the variables are assigned. Thus, this model is easily factorizable to the small cliques, and it cannot describe the imaginable LFP process for some k-SAT. See [Deolalikar, 2010, Chapter 8] for the details.

So what's the problem?

Neil Immerman, an expert in the finite model theory, noticed two major flaws. The first one is that Vinay actually model only monadic LFP, which is not equal to P, as assumed. He thus proved that there is a gap between NP and some subclass of P, which is not obligatory equals P. The second major flow deals with modelling k-SAT as a logical structure. I do not fully understand this step, but some order-invariance assumptions are wrong. Here is a discussion in Lipton's blog. According to it, the flaws are really fatal.

The third way

It seems that it is really difficult to prove P ≠ NP. The most of scientists think that P = NP is improbable, and that's why the hypothesis P ≠ NP is generally adopted now. But there is one more option: neither P = NP nor P ≠ NP is provable. As every mathematical theory, computational complexity is a set of axioms and statements, which are usually could be proved to be correct or not. But there are sometimes some statements formulated in terms of the theory, which could not. Moreover, according to the first Gödel's incompleteness theorem, in any mathematical theory based on our natural understanding of natural numbers, there is either a statement that could be proved both true or false, or an unprovable one. I know no actual reasons why this could not be the case for P ≠ NP (although I have feelings there are some since it is not usually taken into account).

Suppose it is really so. This would mean that the whole theory should be reconsidered. Some axioms or definitions will change to make this fact provable. But may be anything else will become unprovable. Who knows...

Read Users' Comments (3)

Image Parsing: Unifying Segmentation, Detection, and Recognition

Posted in Labels: computer vision, graphical models, history, labelling, MCMC, MRF, paper, parsing, recognition | at 22:38

Recently I have posted about the connection between object category detection and semantic image segmentation. While delving into this problem more deeply, I have found the paper denoted in the title of this post.

The ICCV 2003 paper by Tu et al. was among the three Marr prize winners. And it is really a prominent piece of work! In the introduction to the special Marr prize IJCV issue, Bill Triggs featured the paper as one that "would have been particularly pleasing to David Marr", because of "its bold attack on the central problem of perceptual organization and whole scene understanding". Here is the journal version of the paper.

The authors combined discriminative and generative models, which resulted to the unified framework for image parsing. For each image the corresponding parsing tree could be found. The root of the tree represents the whole scene. On the next level, nodes represent semantic fragments, such as human faces, text, or textured regions. The leaves of the tree correspond to the pixels of the image.

The approach differs drastically from the vanilla CRF framework in the way that the structure of the parsing tree is dynamic while the CRF structure remains constant and just interaction models between the pairs of sites may change. The goal is to obtain the tree that maximizes the posterior probability given the image. The directed search in the space of valid trees is performed by means of Markov chain Monte-Carlo (MCMC). The possible tree changes like split and merge of regions, varying the border between regions, are defined. Such changes are described in terms of learnable Markov chain transition kernels.

What they actually did is they effectively combined top-down and bottom-up approaches. In a nutshell, the bottom-up (discriminative) method generates the hypotheses of pixel labels and object positions using the local neighbourhood, and the top-down generative model is then built making use of those hypotheses. The latest guarantees the consistency of the output and the optimum of its posterior probability.

Okay, what does it mean for the issue of detection and segmentation convergence? Because the shapes of objects are determined implicitly during the recognition, and stuff regions are detected by their colour and texture, the problem of semantic image segmentation is actually solved. Moreover, multi-scale semantic segmentation is performed during the parsing. Therefore, image parsing seems to be the most general formulation of image recognition problem.

The nodes of a parsing tree, which belong to the same depth level, are not connected (right, because it is a tree). This means that spatial interactions could not be modelled directly. A totally different approach was introduced in another Marr prize winning paper by Desai et al [2009]. They also use bottom-up tests to generate hypotheses on the object locations and then use CRF over those location to take spatial context into account. They model different kinds of inter-class and intra-class interactions. Thus, they ensure that the objects in the image (described by their bounding boxes) are arranged correctly, e.g. a cup is likely to be on a table (spatial arrangement), while a train could not be close to a ship (mutual exclusion).

It seems that the two papers exploit the two different forms of information (aggregation vs. spatial interactions), which are mutually redundant to a certain extent, but do not completely exhaust each other. It seems that the group of researchers who will successfully combine those approaches will receive some 201x Marr prize.

Read Users' Comments (1)comments

ICVSS 2010: some trivia

Posted in Labels: education, ICVSS, life | at 00:41

If you think that summer schools are only about lectures and posters, you are certainly wrong. Well, I also did not expect that I would get drunk every night (it was not in Russia after all!). On the first night I convinced people to go to the beach for night swimming, but on the next day I realized that it was not a really good idea, because lectures started at 9. Eventually, we did not go to bed early for the rest of the week.

I would like to thank "Marsa Sicla" people (Fig. 1), you were the great company!

Fig. 1. The Marsa Sicla company. Left to right: Richard, Ramin,
Nicolas, Xi, Vijay, Clément, Aish, me. Sandra is behind the camera.

Actually, there was a reason why we had such a close-knit company. On the Fig. 2 you can see the map of the region where the school took place. The school was hosted by Baia Samuele hotel village (inside the blue frame on the map). The cheapest accommodation at Baia Samuele costed €100 per night, so some pragmatic people scared a different accommodation up. It was in a neighbouring hotel village, Marsa Sicla.

Fig. 2. ICVSS location map. View ICVSS_2010_MSvsBS in a larger map

In a map it looked pretty close, but in practice it turned out that there were a lot of fences (blue line)! Officially, we had to walk along the red route (~2.5 km, partly along a highway) through the official entrance. Moreover, we had to leave the village for lunch (since there were buffet), otherwise we would have been charged €30 per meal. To control it they wanted us to leave our documents at the control post (bottom blue tick on Fig. 2). Thus, we were supposed to walk 2.5 km four times a day. But we found a better solution.

Although the fence was equipped with a barbwire, we managed to find two shortcuts (green ticks). One of them led through ever-closed gate (Fig. 3), which were suitable for hopping over. So, we used the green path usually. It made the way two times shorter. Moreover, we did not have to leave Baia Samuele for lunch, since we weren't officially there. Sometimes we even had free lunch, that finally proved that the no free lunch theorem is wrong (joke by Ramin).

Fig. 3. Hopping over the fence.

Well, the lunch was really great. But this was not really a question of food, the main reason not to leave Baia Samuele was of course socialization, which used to be active during lunch (and also Internet access at the conference centre :). If we knew about such situation before booking the apartments, we would probably followed the official accommodation recommendations. However, we had a great company at Marsa Sicla too. But because of "unofficial night programme" I had to (almost) sleep during lectures. Most of guys was on their last year of PhD, so they considered the school as a vacation. But I really wanted to learn a lot of stuff. It was really difficult to perceive any information when you had not slept enough time. Surprisingly, I somehow managed to pass the final exam, but I feel I need to look through the lecture slides again, reading up some papers they refer to, otherwise the scientific part of the school will be useless for me.

Thus, I finish posting about ICVSS. To conclude, I want to recommend everyone to visit such summer schools, because they are the best way to enter into the community.

Read Users' Comments (2)

ICVSS 2010: Have you cited Lagrange?

Posted in Labels: Cambridge, computer vision, history, ICVSS, lecture, optical flow, paper, PASCAL VOC, Russia, Stanford | at 20:33

Yeah, I have. :) Before the beginning of the summer school we received a homework assignment from Prof. Stefano Soatto. We had to read the three papers and discover the roots of ideas exploited by the authors. Such connections are not obligatory expressed by references. As a result, one had to get a tree (or, may be, lattice) with a root in the given paper. Since the task was not accurately formulated, most of us chose just one paper and made a tree for it. Later it turned out that all three problems (optical flow estimation, multi-view stereo and volume registration) have the same roots because they lead to the similar optimization problems and we were supposed to establish that connection.

I (like most of the students) had a limited time to prepare the work. All the modern research in optical flow is based on two seminal papers by MIT's Horn & Shunk and CMU's Lukas & Kanade, both from 1981. During the last 30 years a lot of work has been done. It was summarized by Sun et al. (2010), who discovered that only small fraction of the ideas give significant improvement and combined them into quite simple method that found its place on the top of Middlebury table. Since I had no time to read a lot of optical flow papers, I discovered a lot of math stuff (like PDEs and variation calculus) used in Horn-Shunk paper. Just as a joke I added references to the works of Newton, Lagrange and d'Alambert to my report. Surprisingly, joke was not really understood.

There were only 21 submissions, one of them was 120 pages long (the author did not show up at the seminar =), the tree depth varied from 1 to 20. I was not the only one who cited "pre-historic" papers, someone traced ideas back to Aristotle through Occam. The questions about the horizon arose: it is really ridiculous to find the roots of computer vision at Aristotle works. Since the prizes were promised for rare references (the prize policy was left unclear too), some argument took place. Soatto did not give any additional explanations on the formulation of the task but let the audience decide which references are legitimate in controversial cases. Eventually, I was among the 5 people whose references were considered controversial, so I needed to defend them. Well, I suppose I looked pretty pathetic talking about Newton's finite differences which were used for approximation of derivatives in the Horn-Shunk's paper. Surprisingly, almost a half of the audience voted for me. =) Also, my reference to the Thomas Reid's essay was tentatively approved.

Finally, there were no bonuses for the papers that could not be found with Google, and the whole $1000 prize went to the Cambridge group. To conclude, Soatto (among other) said that nobody had read any German or Russian papers. After the seminar I told him how I tried to dig into the library (it is described here in Russian), he answered something like "funny, it is hard to find them even for you!"

One evening during the dinner we talked about Russia, and Vijay told a lot of interesting stuff (like the guy who developed ChatRoulette is now working in Silicon Valley). He remembered Stephen Boyd who is known for his jokes about Russia. I said that I watched the records of his amazing course on Convex Optimization. It turned out that Vijay (who is now a PhD student in Stanford) took that course, and he promised to tell Boyd that he has a fan in Russia. (Or, maybe in Soviet Russia fans have Boyd =).

continued here

Read Users' Comments (1)comments

ICVSS 2010: lectures and posters

Posted in Labels: Cambridge, CBIR, computer vision, graphical models, ICVSS, laserscanning, lecture, Microsoft, MRF, recognition, Russia | at 03:17

I returned from my eurotrip yesterday and now I am ready to start a series of posts about International Computer Vision Summer School (ICVSS 2010). Generally, I enjoyed the school. That week gave me (I hope) a lot of new knowledge and new friends.

The scientific programme of the school included lectures, tutorials, a student poster session and a reading group. Lectures occupied most of official programme time and were given by a great team of professors including Richard Szeliski, renowned Tomasso Poggio and enchanting Kristen Grauman. I am not going to describe all the talks, but feature the ones that are close to my interests. You can find the complete programme here. Unfortunately, no video was recorded, but I have an access to all the slides and posters, so if you are interested in anything, I can send it to you (I believe it does not violate any copyrights).

Wednesday was the day of Recognition. Kristen Grauman gave a talk about visual search. She covered the topics concerning specific object search using local descriptors and bags of words, object category search with pyramid matching, and also discussed state-of-the-art in the challenging problem of web-scale image retrieval. Mark Everingham continued with a talk about category recognition and localization using machine learning techniques. Localization is reached using bounding boxes, segments, or object parts search (like finding eyes and a mouth to find a face). Sure he could not avoid to mention importance of context. He also explained the PASCAL VOC evaluation protocol.

The tutorials covered some applications and did not really impress me. Poster session was a great opportunity to meet people. Some posters were really decent, for example Michael Bleyer's poster on dense stereo estimation using soft segmentation, which won a half of the school's best presentation prize. The work was done with Microsoft Research Cambridge, they formulated a really complicated energy function based on surface plane estimations and minimized it with Lempitsky's graph-cut based fusion-move algorithm (2009). More details could be found in their recent CVPR paper.

Me presenting the poster. Photo by Sandra.

The problem with the poster session was that lecturers did not attend it, although they could really give a great feedback. My presentation was in the last day of the session after the not-really-popular reading group and in the room upstairs (its existence was not a well-known fact :), so many people preferred to spend that evening on the beach. The school audience was quite heterogeneous, there were a lot of people from medical imaging, video compression etc., so I had to explain some basics (like MRFs) to some students interested in my poster. There were also really smart guys (like those from Cambridge group). There was a bit of useful feedback: someone recommended me to use QPBO for inference, and I should probably consider it.

Since the speakers did not attend the poster session, students had to communicate with them informally. A roommate of mine, Ramin, who does a crowd analysis, caught Richard Szeliski and asked him about local descriptors in video that operate in 3D image-time space. Szeliski told that it is a promising field and even remembered Ivan Laptev's name. During our tour to Ragusa Ibla I asked Mark Everingham (who is probably the closest of all speakers to my topic) about 3D point cloud classification. He said it was not really his field, but it should be fruitful to analyse clouds not only in local levels, and stuff like multi-layer CRFs could be useful. During the last 2 years there appeared a few papers that exploit that simple intuitive idea and incorporate shape detectors with CRFs, but it usually looks awkward. Well, may be smartly designed multi-layer CRFs might be really useful. It is funny, when I told Everingham that I'm from Moscow he replied that it was great, Moscow has a great math school and remembered Vladimir Kolmogorov. So, our education seems to be not so terrible. :D

continued here

Read Users' Comments (0)

Mendeley news

Posted in Labels: academic, IEEE, Internet, paper, software, web-site | at 11:53

Few months ago I posted about Mendeley, the reference management system. I've been using it all the time and I am really satisfied. It is much more convenient to have your papers on a server than on a flash drive. Also keeping meta-information about the papers makes dealing with them (exploring, reading, citing etc.) more pleasant.

In this month a major update was released and a monetizing scheme was introduced. There are now 3 tariff plans in addition to the free one that allows one to use up to 1Gb of server disk space. There are also restrictions on the max number of shared collections and the number of users per collection. It does not look critical after all: there are no features you cannot use in the free version, so the company seems not evil. However, on the sister project last.fm the radio feature became paid after years of free service, which caused many users drift to grooveshark.

The update has affected the design of the paper info right-hand bar, and the references bar was eliminated. They promise to rebirth the bar in the future releases. It was really unusable before. When you read a paper, it is handy to have the list of indexed references on the bar, especially if the reference-style citing is used (it is the major problem with reading papers from the screen: where the hell [42] is referencing to?) I hope the people in Mendeley realize it.

It would be great to have a possibility to find papers without leaving the application. In theory, you could drag a citation from the references bar to a collection, find its (more) precise details via Internet search, and you are likely to have the link, and if it is direct, you can add the pdf with the Add File dialog using the found URL. In practice, the found URLs usually are not direct, they refer to IEEE/ACM/Springer pages, where the paper could be downloaded (usually not freely). In the same time, the paper is likely to be available for free through the direct link on the author's homepage. Moreover, Google Scholar often finds them too, but Mendeley chooses indirect links.

Mendeley needs to be more "semantic" while working with authors and conferences. It stores the author name as it appears in the paper. When you want to filter your papers by author, you can see "Shapovalov, R", "Shapovalov, R.", "Shapovalov, Roman", "Shapovalov, Roman V." etc. in the list. There are bases like DBLP, Mendeley can build their own index, so the author should be identified. If we know the author (but not only her name appearance in the paper), we are able to filter Google Scholar results and retrieve the direct link to the paper hosted on the author/university site. Finally, it is feasible to create the system, where human don't need to find papers. If I see the reference, I can automatically retrieve the pdf in a few seconds. I want Mendeley to develop in this direction.

Read Users' Comments (0)

Object detection vs. Semantic segmentation

Posted in Labels: computer vision, labelling, MRF, PASCAL VOC, recognition, robotics, Vision 101 | at 02:21

Recently I realized that object class detection and semantic segmentation are the two different ways to solve the recognition task. Although the approaches look very similar, methods vary significantly on the higher level (and sometimes on the lower level too). Let me first state the problem formulations.

Semantic segmentation (or pixel classification) associates one of the pre-defined class labels to each pixel. The input image is divided into the regions, which correspond to the objects of the scene or "stuff" (in terms of Heitz and Koller (2008)). In the simplest case pixels are classified w.r.t. their local features, such as colour and/or texture features (Shotton et al., 2006). Markov Random Fields could be used to incorporate inter-pixel relations.

Object detection addresses the problem of localization of objects of the certain classes. Minimum bounding rectangles (MBRs) of the objects are the ideal output. The simplest approach here is to use a sliding window of varying size and classify sub-images defined by the window. Usually, neighbouring windows have similar features, so each object is likely to be alarmed by several windows. Since multiple/wrong detections are not desirable, non-maximum suppression (NMS) is used. In PASCAL VOC contest an object is considered detected, if the true and found rectangles are intersected on at least half of their union area. In the Marr prize winning paper by Desai et al. (2009) more intelligent scheme for NMS and incorporation of context is proposed. In the recent paper by Alexe the objectness measure for a sliding window is presented.

In theory, the two problems are almost equivalent. Object detection reduces easily to semantic segmentation. If we have a segmentation output, we just need to retain object classes (or discard the "stuff" classes) and take MBRs of regions. The contrary is more difficult. Actually, all the stuff turns into the background class. All the found objects within the rectangles should be segmented, but it is a solvable issue since foreground extraction techniques like GrabCut could be applied. So, there are technical difficulties which could be overcome and the two problems could be considered equivalent, however, in practice the approaches are different.

There arise two questions:

1. Which task has more applications? I think we do not generally need to classify background into e.g. ground and sky (unless we are programming an autonomous robot), we are interested in finding objects more. Do we often need to obtain the exact object boundary?

2. Which task is sufficient for the "retrieval" stage of the intelligent vision system in the philosophical sense? I.e. which task is more suitable for solving the global problem of exhaustive scene analysis?

Thoughts?

Read Users' Comments (5)

Max-product optimization methods and software

Posted in Labels: computer vision, graphical models, labelling, laserscanning, machine learning, MRF, software, tutorial, Vision 101, web-site | at 22:38

I hoped I would never post a sorry-for-not-posting message, but I do (and now I feel like I owe you a long post =). In fact, I have been being really busy, particularly with putting my masters thesis down to paper. However, it has a positive outcome: I systematized my knowledge about inference techniques in MRFs, and I'd like to share them here.

Please don't consider it a survey. While it claims to be quite comprehensive, it is not enough deep and verbose. You are not gonna see a lot of formulas, but the links to papers and implementations along with the recommendations on applicability of the methods and the software. It could be useful for anybody who wants to quickly try energy minimization in her task, but don't wanna spend a lot of time digging into the literature and implementing algorithms. So let's call this form of posting unsurvey. You can always find more details on the topic in Stan Li's book (not to be confused with Stan Lee recently appeared in The Big Bang Theory).

Max-product energy minimization problem arises in a number of fields such as statistical physics, social network analysis, and, of course, computer vision, where the problems like image denoising, dense stereo reconstruction, semantic segmentation lead to the single discrete optimization problem: maximize the product of some functions defined over nodes and edges of the graph G=(N,E), typically a grid over image pixels in low-level vision. Due to using logarithms, it is equivalent to maximizing the sum

where each Y_i corresponds to a node of the graph and takes on a value from a fixed discrete range. It could be a class label in semantic segmentation, or a disparity value in the pixel in dense stereo. So-called potential functions φ (unary in the first term and pairwise in the second one) define possibility of the assignment given class labels to the nodes and edges correspondingly. They could be conditioned by the data. For example, unary potentials could reflect colour information in the pixel, and pairwise potentials could enforce assignment of the similar labels to the ends of an edge to provide smoothness in the resulting labelling. If the graph contains cliques of the order 3 and greater, one should consider more terms (as I explained earlier), but in practice higher order cliques are often ignored. Also, an important particular case, 4-connected grid, contains only second order cliques, that's why the research community focused on this "pairwise" problem.

In general, this maximization problem is NP-hard. That's why in the 1980s they solved it using genetic algorithms or simulated annealing. Fortunately, specific methods have been developed.

First of all, there are two cases when the exact global optimum could be found in polynomial time. The first is when the graph has a tree structure. Actually, I don't know application where this case is useful. Probably, it is good for analysis of parsing trees in natural language processing. In this case the maximum is found using the dynamic programming based belief propagation algorithm. One node is selected as the root, and messages are passed from the root and then back to the root. The optimal assignment is thus found.

Another polynomial case allows an arbitrary graph topology, but restricts the number of labels and the form of pairwise potentials. Only one of the two labels could be assigned to each node. Pairwise potentials should be submodular, i.e. φ(0,0) + φ(1,1) >= φ(0,1) + φ(1,0). Thus, the binary assignment implies that all the nodes should be divided into the two groups according to their labels. This division is found using the algorithm for finding minimum cut on the supplementary graph. Two fake nodes are added to the graph and declared as the source and the sink. Both fake nodes are adjacent to all the nodes of the "old" graph. Using your favourite min-cut/max-flow algorithm you can separate the nodes into two sets, which gives you the resulting labelling.

Unfortunately, the binary labelling problem is not so common too. In practice people want to choose one of the multiple labels for the nodes. This problem is NP-hard, but the iterative procedures of α-expansion and αβ-swap give good approximations with certain theoretical guarantees. [Boykov et al., 2001] During each iteration the min-cut algorithm is run, so there is an important constraint. Each projection to any two variables of pairwise potentials should be submodular. If it is hold for your energy, I recommend you to use this method since it is really fast and accurate. The implementation by Olga Veksler is available.

What should we do if the energy is not submodular and the graph contains cycles? There are some methods too, but they are not so fast and typically not so effective. First of all, the stated integer programming problem could be relaxed to linear programming. You can use any LP solver, but it will eventually take days to find the optimum. Moreover, it won't be exact, because the relaxed problem solution could be rounded back in multiple ways. No surprise here, because the possibility of getting the global optimum would prove that P = NP.

One of the methods for approaching the general problem is Loopy Belief Propagation [Kschischang, 2001]. No doubt, it is really loopy. Dynamic programming is used on the graph with loops, and it can eventually fall into the endless loop passing some messages along the loop. It is now considered outdated, but you can still try it, it is implemented in libDAI. The input is a factor graph, which is inconvenient in most cases, but really flexible (you could create cliques of arbitrary order).

State-of-the-art approach in general case, tree-reweighted message passing (TRW), has been developed by Wainwright and Kolmogorov. [Kolmogorov, 2006] It is based on the decomposition of the original graph to some graphs where exact message-passing inference is possible, i.e. trees. Message passing is done iteratively on the trees, and trees are enforced to assign the same values in particular nodes. The procedure is guaranteed to converge, but unfortunately not always to the global maximum. However, it is better than Loopy BP. Kolmogorov's implementation contains LBP and TRW-S under a common interface, so you can compare their performance. To understand the decomposition technique better, you could read Komodakis's paper [2007]. It describes a general decomposition framework.

Once a company of the world-leading MRF researchers gathered and implemented LBP, graph-cuts and TRW-S under common interface, investigated performance and even shared their code on a Middlebury college page. It would be great to use those implementations interchangeably, but unfortunately the interface is tailored for the stereo problem, i.e. it is impossible to use the general form of potentials. My latest research is about usage MRFs for laser scan classification, where the topology of the graph is far from a rectangular grid, and the potentials are more complex than the ones in the Potts model. So I ended up in writing my own wrappers.

Another interesting method was re-discovered by Kolmogorov and Rother [2007]. It is known as quadratic pseudo-boolean optimization (QPBO). This graph-cut based method can either assign a label to a node, or reject to classify it. The consistency property is guaranteed: if the label has been assigned, this exact label is assigned in the global optimum. The equivalent solution could be obtained using a specific LP relaxation with 1/2 probabilities for class labels allowed. I have not used the technique, so I cannot recommend an implementation.

To summarize, the algorithm for choosing an optimization method is the following. If your graph is actually a tree, use belief propagation. Otherwise, if your energy is submodular, use graph-cut based inference; it is both effective and efficient (even exact in the binary case). If not, use TRW-S, it often yields a good approximation.

Read Users' Comments (5)

On Fundamental Artificial Intelligence

Posted in Labels: AI, history, lecture, machine learning, robotics | at 22:51

Today I attended Mikhail Burtsev's lecture on evolutionary approach to AI. The lecture was hosted by Moscow Polytechnic Museum's lecture hall. First of all, the organization was great, and the quality of presentation was high. Unsurprisingly, the hall was almost totally occupied by the listeners. I recommend you to attend their lectures.

Mikhail talked a lot about the history and foundations of AI, which was not new to me. He did not bore the audience with the psychology-related stuff, but featured two approaches to AI, that have been being developed simultaneously all the way since 1950s: artificial neural networks and symbolic computational AI. ANNs attempt to model human brain functioning by means of artificial neurons and axons. The problem here is if we build even human-scale artificial brain (which was actually done in Allen Institute for Brain Science in 2005), it will be isolated from the world, and could not be trained like real human brain. Symbolic AI uses formal logic to solve the problems. It also suffers certain problems, like the lack of uncertainty. So, vanilla AI is in the period of stagnation now (it dates bake to the late 1980s, when Rodney Brooks started to criticize it).

Mikhail's point was the one that most researchers attempt to build a straight away model of human intelligence. He suggested to look at the evolution process. We need to start with understanding how the neural system of an amoeba operates, and then track the evolution of it to find out how a human being's mind works. Evolutionary algorithms might be useful there.

I cannot agree with that point. Last 50 years witnessed the failure of fundamental AI research. Their common problem was they tried to model some cognitive mechanisms instead of tackling actual problems, while those problems were being solved by different techniques. Consider the autonomous robot control problem (e.g. a robotic vacuum cleaner). It can be formulated in terms of reinforcement learning and solved by the methods not connected to the vanilla AI (dynamic programming-based, I suppose). Mikhail does not depart from the problem, again. So I don't really believe in success of the method. I am also sceptic to the computational methods inspired by biology. They could be a good starting point, but they used to be supplanted by mathematically founded ones.

Another issue raised in the discussion after the lecture was AI safety, which is probably the most popular philosophical question connected to AI. As far as I understood, there are two main concerns: the AI could become self-replicated (or self-improved) uncontrollably, and it could become self-motivated. If the AI is self-replicated, it is not so dangerous unless it is self-motivated. If it is motivated by a human, it is bad just like nuclear weapon, which is scary only if an evil party possesses it (so, nothing brand new here). If the AI will be motivated to gain the world domination, it could be harmful. But where such self-motivation can come from? In my opinion, the only option is the biological obsession to propagate (imagine some joint AI system with bio-motivator), but such a disaster has already been overcome once. Do you remember that rabbits in Australia? I hope, an uncontrolled AI won't be a bigger problem.

Read Users' Comments (0)

ICVSS 2010

Posted in Labels: computer vision, history, ICVSS, life, paper, Russia, science | at 22:16

This summer I'm going to take part at the International Computer Vision Summer School. I have already been accepted and registered for it. I'm really looking forward this event and hope it will not be too hot in July in Sicily. =)

I am now seeking for a room-mate, because accommodation is quite expensive there. If you are a participant too and have a similar problem, please feel free to contact me!

One of the activities of the school is a reading group leaded by Prof. Stefano Soatto. As a homework, we need to investigate the reference trees on three certain subjects. Moreover, there will be prizes for students who will have discovered papers that could not be found via standard search engines. I suppose I have a bonus here: Russian science has been isolated from the, well, science due to that iron curtain thing (it is still quite isolated, by inertia). So, I just need to dig in a library and find something appropriate in old Soviet journals.

read all my posts about ICVSS 2010

Read Users' Comments (2)

Web-scale Content Based Image Retrieval

Posted in Labels: CBIR, computer vision, Google, Internet, movie, web-site | at 00:38

Do you remember the concept of Computer Blindness? It is about people intuitively expecting from computer vision algorithms results unreachable by the state-of-the-art methods. Believe or not, recently I fell for that trick too.

I was looking throw Navneet Dalal's slides on Histograms of Oriented Gradients. They contained a lot of frames captured from the movies as examples. There was the following one among them:

It seemed familiar to me. I endeavoured to remember the movie, but I failed.¹ So I decided to check out some web-sites that offered the inverse image retrieval.

Sure, first I turned to St. Google. The similar image service disappointed me since it was not able to find the similar image of what I want. Actually, it has some indexed base (not very large), and one can find the similar images only within that base. If one somehow find the image from the base (e.g. using a text query), (s)he is shown a button that allows similar image retrieval.

There are different web-sites for this purpose. TinEye positions itself as a reverse image search engine. However, it failed to find anything. It honestly admitted that nothing similar was found. GazoPa found something, but that was different from what I had expected, though the found images were similar in saturation. It was strange because usually such methods work on greyscale images to be robust to the colour levels shifts.

Then I decided to check if the situation is common, and tried a different image, which I found on my desktop:²

I wanted to find the name of its author and the title³. The result was almost the same. TinEye found nothing, GazoPa found a lot of pictures of different girls.

Why is the web-scale CBIR not possible to date? Because there are simply a lot of images in the web, and it is intractable to perform search in such a large index. The common workflow implies feature extraction and further feature matching. From each image hundreds of features could be extracted. Each feature is a high-dimension vector (e.g. 128D). Suppose we have N images in the index. If we extract 100 features from each (which is fairly the lower bound), to handle the given picture we should match 100 * 100 * N features in 128D space. It is really hard to do it instantly even if N is small. In 128D, indexing structures like kd-trees do not improve performance over exhaustive search, because branch and bound method is unable to reduce the search space in practice (approximate methods partly solve the problem). For example, it took 13 hours to match 150,000 photos on 496 processor cores for the Building Rome in a Day project. This also tells us that there are 150K photos of Rome in the web, so try to imagine the whole number of pictures!

But there is a certain hope for success in the web-scale CBIR. First, when we deal with billions of photos (and trillions of features) kd-trees are likely to give sublinear performance. Second, one could use the web context of an image to seed out come obviously wrong matches.

In the long run, we need to develop more descriptive image representation and learn how to combine the textual content with the content. Also, using the user interest prior could be useful. The engine may start the search from the most popular photos and ignore the least popular ones. Thus, the task could be formulated as a sequential test, where the predictor should be able to say if the found match is good enough, or it should continue search.

UPD (Apr 7, 2010). Here is a rather broad list of visual image search engines.

¹ The first thought was about Roman Polanski's "The Pianist", which is surely wrong. If you recognize the movie please let me know!

² I've seen the picture in the Hermitage Museum and downloaded it after returning from St. Petersburg, because the girl had reminded me my friend Sveta.

³ It is actually Franz von Lenbach's Portrait of Marion Lenbach, his daughter.

Read Users' Comments (6)

Computer Blindness

and Computer Blondness

About Me

Labels

Blog Archive

Followers

Comic strip: Nerd's confession

Happy Thaknsgiving!

Nice paper acknowledgement

Papers, citations, co-authorship, and genealogy

LidarK 2.0 released

ECCV 2010 highlights

Comic strip: 10%

Theorem 8.10. P ≠ NP.

Image Parsing: Unifying Segmentation, Detection, and Recognition

ICVSS 2010: some trivia

ICVSS 2010: Have you cited Lagrange?

ICVSS 2010: lectures and posters

Mendeley news

Object detection vs. Semantic segmentation

Max-product optimization methods and software

On Fundamental Artificial Intelligence

ICVSS 2010

Web-scale Content Based Image Retrieval

and Computer Blondness

About Me

Labels

Blog Archive

Subscribe To Computer Blindness

Followers