Last summer, during my internship at CMP in Prague I drew some xkcd-style comic strips, and I want to share some of them here. I considered starting a different blog for that, but there are too few of them, and they do suck (at least in comparison to Randall's work). However, most of them are programming/research related, so they are appropriate here.
I planned to use them when I would have nothing to post about. This is not the case now (I am planning a couple more posts in a week or so), but this one is topical. It was originally drawn in September 1, 2009 when GMail was not operating for a few hours. From the recent news: Google shuts Google Wave down, admitting their failure predicted by some sceptics (see the image title text).
Nearly two weeks ago Indian-American mathematician Vinay Deolalikar, the employee of HP Labs, sent a letter to a group of mathematicians asking them to proof-read his attempt to prove P ≠ NP. The pre-print was eventually distributed over the Internet, so a lot of people became aware. Later two major flaws were found in the proof, so it is now considered wrong. Nevertheless, Vinay has removed the paper from his page and commented that he fixed all the issues and is going to submit a paper to the journal review. You can download the original paper from my web-site [Deolalikar, 2010].
We'll see if he is bluffing or not. In any case, it is very pleasant to me that the main role in the proof is played by... the graphical probabilistic model framework. Yeah, those graphical models we use in computer vision! It was surprising indeed, since the problems in computational complexity are usually discrete and well-posed. So, this post is devoted to understanding why a graphical model emerges in the proof.
First, I am going to review some key points of the proof. It is far from being exhaustive, just some rough sketches that are critical for understanding. Then I report on the flaws found in the proof, and then I end up with a methodological note on the whole P vs. NP problem.
k-SAT and d1RSB Phase
In order to prove P ≠ NP, it is enough to show the absence of a polynomial algorithm for some problem from the class NP, e.g. for some NP-complete problem (which are the "hardest" problems in NP). The boolean satisfiability problem (SAT) addresses the issue of checking if there is at least one assignment of the boolean variables in the formula, represented in conjunctive normal form (CNF), which makes it TRUE. The k-satisfiability problem (k-SAT) is a particular case of SAT, where all the clauses in the CNF have order k. For example, (x˅y)&(¬x˅y)&(x˅¬y)&(¬x˅¬y) is an instance of 2-SAT which has the answer 'NO', since any boolean values of (x, y) makes the formula false. (x˅y˅z)&(¬x˅u˅v)&(¬x˅u˅¬z) is a 3-SAT over 5 variables (x, y, z, u, v), which is satisfiable, e.g. on the input (1,0,0,1,0). 2-SAT is in P, but k-SAT is NP-complete whenever k > 2. The Deolalikar's idea is to show that k-SAT (with k > 8) is outside of P, which (if true) proves that there is a separation between P and NP.
The space of possible k-SAT solutions is analysed. Let m be the number of clauses in the CNF, n be the number of variables. Consider the ratio α = m/n. In the border case m = 1 (α is small), k-SAT is always true, there are usually a lot of inputs that satisfy the CNF. Consider an ensemble of random k-SATs (the following is statistical reasoning). With the growth of α, the probability of the CNF to be satisfiable decreases. When a certain threshold αdis reached, the "true" solutions set breaks into clusters. This stage is known in statistical mechanics as dynamical one step replica symmetry breaking (d1RSB) phase. Moving further, we make the problem completely unsatisfiable. It turns out that d1RSB stage subproblems are the most challenging of the all k-SAT problems. The effect could be observed only if k > 8, that's why such problems are used in the proof. [Deolalikar, 2010, Chapter 6]
FO(PFP) and FO(LFP)
In the finite model theory there are recurrent extensions of the first-order logic. The predicate Pi(x) is evaluated as some second-order function φ(Pi-1, x), where x is a k-element vector, P0(x) ≡ false. In the general case, either there is a fixed point, or is Pi looping. For example, if φ(P, x) ≡ ¬P(x), then P0(x) ≡ false, P1(x) ≡ true, P2(x) ≡ false, etc. Here x is meaningless, but it is not always the case. Consider the following definition: φ(P, x) ≡ max(x) ˅ P(x) // recall x is a vector, in the boolean case 'max' is the same as 'or'. If x contains at least one non-zero element, P0(x) = false, P1(x) = true, P2(x) = true, etc. Otherwise, Pi(0) = false for all i. In the case of looping, let's redefine the fixed point to be constantly false. FO(PFP) is a class of problems of checking if there will be a loop for some x, or a fixed point. They are actually very difficult problems. FO(PFP) is equal to the whole PSPACE. Suppose now that φ is monotonic on P. It means that P(x) only appears in the formula with even number of negations before it (or zero, as in the second example). This means that once Pi(x) is true, Pj(x) will be true for any j > i. This reduces the class to FO(LFP), which is proven to be equal to the class P. So, the global problem now is to show that k-SAT is outside of the class FO(LFP).
ENSP: the graphical model for proving P ≠ NP
So how a graphical model emerges here? Graphical model is a way to describe a multivariate probability distribution, i.e. dependencies of covariates in the distribution. Recall now the definition of NP. If we somehow guess the certificate, we are done (i.e. have a polynomial algorithm). If the space of certificates (possible solutions, in terms of Deolalikar) is simple, we can probably get a polynomial algorithm. What is simple? This means that the distribution is parametrizable with a small amount of parameters (c.f. intrinsic dimensionality), which allows us to traverse the solution space efficiently. This is twofold. First, the distribution is simple if it has a limited support, i.e. all the probability measure is distributed among a limited number of points of the probabilistic space. Second, it is simple if the covariates are as much conditionally independent as possible. In the ideal case, if all the groups of covariates are independent (recall that pairwise and mutual independence do not subsume each other!), we need only n parameters to describe the distribution (n is the number of covariates), while in general case this number is exponential. See [Deolalikar, 2010, p. 6] for examples.
How to measure the degree of conditional independence? Yes, to build a graphical model. It is factorized to cliques according to Hammersley-Clifford theorem. Larger cliques you get, stronger dependency is. When the largest cliques are k-interactions, the distribution can be parametrized with n2k independent parameters. Finally, Vinay shows that FO(LFP) can deal only with the distributions parametrizable with 2poly(log n) parameters, which is not the case for d1RSB k-SAT (its solution space is too complicated).
In order to show it strictly, Deolalikar introduces a tailored graphical model describing LO(LPF) iterative process, the Element-Neighborhood-Stage Product model (ENSP):
There are two types of sites: corresponding to the variables on the stages of LFP (small circles) and corresponding to the elements of witty neighbourhood system (some closure over Gaifman graph; big blue circles). When a variable is assigned 0 or 1, it is painted with red or green respectively. The last column corresponds to the fixed point, all the variables are assigned. Thus, this model is easily factorizable to the small cliques, and it cannot describe the imaginable LFP process for some k-SAT. See [Deolalikar, 2010, Chapter 8] for the details.
So what's the problem?
Neil Immerman, an expert in the finite model theory, noticed two major flaws. The first one is that Vinay actually model only monadic LFP, which is not equal to P, as assumed. He thus proved that there is a gap between NP and some subclass of P, which is not obligatory equals P. The second major flow deals with modelling k-SAT as a logical structure. I do not fully understand this step, but some order-invariance assumptions are wrong. Here is a discussion in Lipton's blog. According to it, the flaws are really fatal.
The third way
It seems that it is really difficult to prove P ≠ NP. The most of scientists think that P = NP is improbable, and that's why the hypothesis P ≠ NP is generally adopted now. But there is one more option: neither P = NP nor P ≠ NP is provable. As every mathematical theory, computational complexity is a set of axioms and statements, which are usually could be proved to be correct or not. But there are sometimes some statements formulated in terms of the theory, which could not. Moreover, according to the first Gödel's incompleteness theorem, in any mathematical theory based on our natural understanding of natural numbers, there is either a statement that could be proved both true or false, or an unprovable one. I know no actual reasons why this could not be the case for P ≠ NP (although I have feelings there are some since it is not usually taken into account).
Suppose it is really so. This would mean that the whole theory should be reconsidered. Some axioms or definitions will change to make this fact provable. But may be anything else will become unprovable. Who knows...
Recently I have posted about the connection between object category detection and semantic image segmentation. While delving into this problem more deeply, I have found the paper denoted in the title of this post.
The ICCV 2003 paper by Tu et al. was among the three Marr prize winners. And it is really a prominent piece of work! In the introduction to the special Marr prize IJCV issue, Bill Triggs featured the paper as one that "would have been particularly pleasing to David Marr", because of "its bold attack on the central problem of perceptual organization and whole scene understanding". Here is the journal version of the paper.
The authors combined discriminative and generative models, which resulted to the unified framework for image parsing. For each image the corresponding parsing tree could be found. The root of the tree represents the whole scene. On the next level, nodes represent semantic fragments, such as human faces, text, or textured regions. The leaves of the tree correspond to the pixels of the image.
The approach differs drastically from the vanilla CRF framework in the way that the structure of the parsing tree is dynamic while the CRF structure remains constant and just interaction models between the pairs of sites may change. The goal is to obtain the tree that maximizes the posterior probability given the image. The directed search in the space of valid trees is performed by means of Markov chain Monte-Carlo (MCMC). The possible tree changes like split and merge of regions, varying the border between regions, are defined. Such changes are described in terms of learnable Markov chain transition kernels.
What they actually did is they effectively combined top-down and bottom-up approaches. In a nutshell, the bottom-up (discriminative) method generates the hypotheses of pixel labels and object positions using the local neighbourhood, and the top-down generative model is then built making use of those hypotheses. The latest guarantees the consistency of the output and the optimum of its posterior probability.
Okay, what does it mean for the issue of detection and segmentation convergence? Because the shapes of objects are determined implicitly during the recognition, and stuff regions are detected by their colour and texture, the problem of semantic image segmentation is actually solved. Moreover, multi-scale semantic segmentation is performed during the parsing. Therefore, image parsing seems to be the most general formulation of image recognition problem.
The nodes of a parsing tree, which belong to the same depth level, are not connected (right, because it is a tree). This means that spatial interactions could not be modelled directly. A totally different approach was introduced in another Marr prize winning paper by Desai et al . They also use bottom-up tests to generate hypotheses on the object locations and then use CRF over those location to take spatial context into account. They model different kinds of inter-class and intra-class interactions. Thus, they ensure that the objects in the image (described by their bounding boxes) are arranged correctly, e.g. a cup is likely to be on a table (spatial arrangement), while a train could not be close to a ship (mutual exclusion).
It seems that the two papers exploit the two different forms of information (aggregation vs. spatial interactions), which are mutually redundant to a certain extent, but do not completely exhaust each other. It seems that the group of researchers who will successfully combine those approaches will receive some 201x Marr prize.