Object detection vs. Semantic segmentation

Recently I realized that object class detection and semantic segmentation are the two different ways to solve the recognition task. Although the approaches look very similar, methods vary significantly on the higher level (and sometimes on the lower level too). Let me first state the problem formulations.

Semantic segmentation (or pixel classification) associates one of the pre-defined class labels to each pixel. The input image is divided into the regions, which correspond to the objects of the scene or "stuff" (in terms of Heitz and Koller (2008)). In the simplest case pixels are classified w.r.t. their local features, such as colour and/or texture features (Shotton et al., 2006). Markov Random Fields could be used to incorporate inter-pixel relations.

Object detection addresses the problem of localization of objects of the certain classes. Minimum bounding rectangles (MBRs) of the objects are the ideal output. The simplest approach here is to use a sliding window of varying size and classify sub-images defined by the window. Usually, neighbouring windows have similar features, so each object is likely to be alarmed by several windows. Since multiple/wrong detections are not desirable, non-maximum suppression (NMS) is used. In PASCAL VOC contest an object is considered detected, if the true and found rectangles are intersected on at least half of their union area. In the Marr prize winning paper by Desai et al. (2009) more intelligent scheme for NMS and incorporation of context is proposed. In the recent paper by Alexe the objectness measure for a sliding window is presented.

In theory, the two problems are almost equivalent. Object detection reduces easily to semantic segmentation. If we have a segmentation output, we just need to retain object classes (or discard the "stuff" classes) and take MBRs of regions. The contrary is more difficult. Actually, all the stuff turns into the background class. All the found objects within the rectangles should be segmented, but it is a solvable issue since foreground extraction techniques like GrabCut could be applied. So, there are technical difficulties which could be overcome and the two problems could be considered equivalent, however, in practice the approaches are different.

There arise two questions:
1. Which task has more applications? I think we do not generally need to classify background into e.g. ground and sky (unless we are programming an autonomous robot), we are interested in finding objects more. Do we often need to obtain the exact object boundary?
2. Which task is sufficient for the "retrieval" stage of the intelligent vision system in the philosophical sense? I.e. which task is more suitable for solving the global problem of exhaustive scene analysis?

Thoughts?

Read Users' Comments (4)

4 Response to "Object detection vs. Semantic segmentation"

  1. hr0nix says:
    23 June 2010 at 00:19

    "Semantic segmentation reduces easily to object detection" means "semantic segmentation can be solved if you have access to an oracle for the object detection task". You have meant the contrary here.

    http://en.wikipedia.org/wiki/Reduction_(complexity)

  2. Roman V. Shapovalov says:
    27 June 2010 at 11:30

    Fixed.
    BTW, thank you for the feedback!

  3. 승완 says:
    3 August 2010 at 16:50

    I think, to robot or baby, 'Semantic segmentation' will be processed for learning objects first and 'Object detection' will be processed for recognizing objects. What do you think?
    I'm student from Korea.

  4. Roman V. Shapovalov says:
    16 August 2010 at 14:29

    I see your point. While learning, the background clutter adds noise to the learned distribution, if the bounding box is used. It could not be a problem on detection stage, if e.g. a bag-of-features approach is used, because it is enough to match only not-very-big rate of object features. Fair enough. But I don't think it is always true. If we want to use any shape cues, we definitely need to segment objects during detection. Also, the rule of thumb in ML is: the distribution of classes should be invariant of train/test stage.

    From the neuro-scientific point of view, the borders are detected on early stages, even before the brain -- in the eye retina. That's why this information is critical for detection in case of humans.

    You might also check out my follow-up post on this topic: http://computerblindness.blogspot.com/2010/08/image-parsing-unifying-segmentation.html

Post a Comment