Recently I realized that object class detection and semantic segmentation are the two different ways to solve the recognition task. Although the approaches look very similar, methods vary significantly on the higher level (and sometimes on the lower level too). Let me first state the problem formulations.
Semantic segmentation (or pixel classification) associates one of the pre-defined class labels to each pixel. The input image is divided into the regions, which correspond to the objects of the scene or "stuff" (in terms of Heitz and Koller (2008)). In the simplest case pixels are classified w.r.t. their local features, such as colour and/or texture features (Shotton et al., 2006). Markov Random Fields could be used to incorporate inter-pixel relations.
Object detection addresses the problem of localization of objects of the certain classes. Minimum bounding rectangles (MBRs) of the objects are the ideal output. The simplest approach here is to use a sliding window of varying size and classify sub-images defined by the window. Usually, neighbouring windows have similar features, so each object is likely to be alarmed by several windows. Since multiple/wrong detections are not desirable, non-maximum suppression (NMS) is used. In PASCAL VOC contest an object is considered detected, if the true and found rectangles are intersected on at least half of their union area. In the Marr prize winning paper by Desai et al. (2009) more intelligent scheme for NMS and incorporation of context is proposed. In the recent paper by Alexe the objectness measure for a sliding window is presented.
In theory, the two problems are almost equivalent. Object detection reduces easily to semantic segmentation. If we have a segmentation output, we just need to retain object classes (or discard the "stuff" classes) and take MBRs of regions. The contrary is more difficult. Actually, all the stuff turns into the background class. All the found objects within the rectangles should be segmented, but it is a solvable issue since foreground extraction techniques like GrabCut could be applied. So, there are technical difficulties which could be overcome and the two problems could be considered equivalent, however, in practice the approaches are different.
There arise two questions:
1. Which task has more applications? I think we do not generally need to classify background into e.g. ground and sky (unless we are programming an autonomous robot), we are interested in finding objects more. Do we often need to obtain the exact object boundary?
2. Which task is sufficient for the "retrieval" stage of the intelligent vision system in the philosophical sense? I.e. which task is more suitable for solving the global problem of exhaustive scene analysis?