Recently I have posted about the connection between object category detection and semantic image segmentation. While delving into this problem more deeply, I have found the paper denoted in the title of this post.
The ICCV 2003 paper by Tu et al. was among the three Marr prize winners. And it is really a prominent piece of work! In the introduction to the special Marr prize IJCV issue, Bill Triggs featured the paper as one that "would have been particularly pleasing to David Marr", because of "its bold attack on the central problem of perceptual organization and whole scene understanding". Here is the journal version of the paper.
The authors combined discriminative and generative models, which resulted to the unified framework for image parsing. For each image the corresponding parsing tree could be found. The root of the tree represents the whole scene. On the next level, nodes represent semantic fragments, such as human faces, text, or textured regions. The leaves of the tree correspond to the pixels of the image.
The approach differs drastically from the vanilla CRF framework in the way that the structure of the parsing tree is dynamic while the CRF structure remains constant and just interaction models between the pairs of sites may change. The goal is to obtain the tree that maximizes the posterior probability given the image. The directed search in the space of valid trees is performed by means of Markov chain Monte-Carlo (MCMC). The possible tree changes like split and merge of regions, varying the border between regions, are defined. Such changes are described in terms of learnable Markov chain transition kernels.
What they actually did is they effectively combined top-down and bottom-up approaches. In a nutshell, the bottom-up (discriminative) method generates the hypotheses of pixel labels and object positions using the local neighbourhood, and the top-down generative model is then built making use of those hypotheses. The latest guarantees the consistency of the output and the optimum of its posterior probability.
Okay, what does it mean for the issue of detection and segmentation convergence? Because the shapes of objects are determined implicitly during the recognition, and stuff regions are detected by their colour and texture, the problem of semantic image segmentation is actually solved. Moreover, multi-scale semantic segmentation is performed during the parsing. Therefore, image parsing seems to be the most general formulation of image recognition problem.
The nodes of a parsing tree, which belong to the same depth level, are not connected (right, because it is a tree). This means that spatial interactions could not be modelled directly. A totally different approach was introduced in another Marr prize winning paper by Desai et al . They also use bottom-up tests to generate hypotheses on the object locations and then use CRF over those location to take spatial context into account. They model different kinds of inter-class and intra-class interactions. Thus, they ensure that the objects in the image (described by their bounding boxes) are arranged correctly, e.g. a cup is likely to be on a table (spatial arrangement), while a train could not be close to a ship (mutual exclusion).
It seems that the two papers exploit the two different forms of information (aggregation vs. spatial interactions), which are mutually redundant to a certain extent, but do not completely exhaust each other. It seems that the group of researchers who will successfully combine those approaches will receive some 201x Marr prize.