Max-product optimization methods and software

I hoped I would never post a sorry-for-not-posting message, but I do (and now I feel like I owe you a long post =). In fact, I have been being really busy, particularly with putting my masters thesis down to paper. However, it has a positive outcome: I systematized my knowledge about inference techniques in MRFs, and I'd like to share them here.

Please don't consider it a survey. While it claims to be quite comprehensive, it is not enough deep and verbose. You are not gonna see a lot of formulas, but the links to papers and implementations along with the recommendations on applicability of the methods and the software. It could be useful for anybody who wants to quickly try energy minimization in her task, but don't wanna spend a lot of time digging into the literature and implementing algorithms. So let's call this form of posting unsurvey. You can always find more details on the topic in Stan Li's book (not to be confused with Stan Lee recently appeared in The Big Bang Theory).

Max-product energy minimization problem arises in a number of fields such as statistical physics, social network analysis, and, of course, computer vision, where the problems like image denoising, dense stereo reconstruction, semantic segmentation lead to the single discrete optimization problem: maximize the product of some functions defined over nodes and edges of the graph G=(N,E), typically a grid over image pixels in low-level vision. Due to using logarithms, it is equivalent to maximizing the sum


where each Yi corresponds to a node of the graph and takes on a value from a fixed discrete range. It could be a class label in semantic segmentation, or a disparity value in the pixel in dense stereo. So-called potential functions φ (unary in the first term and pairwise in the second one) define possibility of the assignment given class labels to the nodes and edges correspondingly. They could be conditioned by the data. For example, unary potentials could reflect colour information in the pixel, and pairwise potentials could enforce assignment of the similar labels to the ends of an edge to provide smoothness in the resulting labelling. If the graph contains cliques of the order 3 and greater, one should consider more terms (as I explained earlier), but in practice higher order cliques are often ignored. Also, an important particular case, 4-connected grid, contains only second order cliques, that's why the research community focused on this "pairwise" problem.

In general, this maximization problem is NP-hard. That's why in the 1980s they solved it using genetic algorithms or simulated annealing. Fortunately, specific methods have been developed.

First of all, there are two cases when the exact global optimum could be found in polynomial time. The first is when the graph has a tree structure. Actually, I don't know application where this case is useful. Probably, it is good for analysis of parsing trees in natural language processing. In this case the maximum is found using the dynamic programming based belief propagation algorithm. One node is selected as the root, and messages are passed from the root and then back to the root. The optimal assignment is thus found.

Another polynomial case allows an arbitrary graph topology, but restricts the number of labels and the form of pairwise potentials. Only one of the two labels could be assigned to each node. Pairwise potentials should be submodular, i.e. φ(0,0) + φ(1,1) >= φ(0,1) + φ(1,0). Thus, the binary assignment implies that all the nodes should be divided into the two groups according to their labels. This division is found using the algorithm for finding minimum cut on the supplementary graph. Two fake nodes are added to the graph and declared as the source and the sink. Both fake nodes are adjacent to all the nodes of the "old" graph. Using your favourite min-cut/max-flow algorithm you can separate the nodes into two sets, which gives you the resulting labelling.

Unfortunately, the binary labelling problem is not so common too. In practice people want to choose one of the multiple labels for the nodes. This problem is NP-hard, but the iterative procedures of α-expansion and αβ-swap give good approximations with certain theoretical guarantees. [Boykov et al., 2001] During each iteration the min-cut algorithm is run, so there is an important constraint. Each projection to any two variables of pairwise potentials should be submodular. If it is hold for your energy, I recommend you to use this method since it is really fast and accurate. The implementation by Olga Veksler is available.

What should we do if the energy is not submodular and the graph contains cycles? There are some methods too, but they are not so fast and typically not so effective. First of all, the stated integer programming problem could be relaxed to linear programming. You can use any LP solver, but it will eventually take days to find the optimum. Moreover, it won't be exact, because the relaxed problem solution could be rounded back in multiple ways. No surprise here, because the possibility of getting the global optimum would prove that P = NP.

One of the methods for approaching the general problem is Loopy Belief Propagation [Kschischang, 2001]. No doubt, it is really loopy. Dynamic programming is used on the graph with loops, and it can eventually fall into the endless loop passing some messages along the loop. It is now considered outdated, but you can still try it, it is implemented in libDAI. The input is a factor graph, which is inconvenient in most cases, but really flexible (you could create cliques of arbitrary order).

State-of-the-art approach in general case, tree-reweighted message passing (TRW), has been developed by Wainwright and Kolmogorov. [Kolmogorov, 2006] It is based on the decomposition of the original graph to some graphs where exact message-passing inference is possible, i.e. trees. Message passing is done iteratively on the trees, and trees are enforced to assign the same values in particular nodes. The procedure is guaranteed to converge, but unfortunately not always to the global maximum. However, it is better than Loopy BP. Kolmogorov's implementation contains LBP and TRW-S under a common interface, so you can compare their performance. To understand the decomposition technique better, you could read Komodakis's paper [2007]. It describes a general decomposition framework.

Once a company of the world-leading MRF researchers gathered and implemented LBP, graph-cuts and TRW-S under common interface, investigated performance and even shared their code on a Middlebury college page. It would be great to use those implementations interchangeably, but unfortunately the interface is tailored for the stereo problem, i.e. it is impossible to use the general form of potentials. My latest research is about usage MRFs for laser scan classification, where the topology of the graph is far from a rectangular grid, and the potentials are more complex than the ones in the Potts model. So I ended up in writing my own wrappers.

Another interesting method was re-discovered by Kolmogorov and Rother [2007]. It is known as quadratic pseudo-boolean optimization (QPBO). This graph-cut based method can either assign a label to a node, or reject to classify it. The consistency property is guaranteed: if the label has been assigned, this exact label is assigned in the global optimum. The equivalent solution could be obtained using a specific LP relaxation with 1/2 probabilities for class labels allowed. I have not used the technique, so I cannot recommend an implementation.

To summarize, the algorithm for choosing an optimization method is the following. If your graph is actually a tree, use belief propagation. Otherwise, if your energy is submodular, use graph-cut based inference; it is both effective and efficient (even exact in the binary case). If not, use TRW-S, it often yields a good approximation.

Read Users' Comments (5)

5 Response to "Max-product optimization methods and software"

  1. hr0nix says:
    1 June 2010 at 00:30

    >> So-called potential functions φ (unary in the first term and pairwise in the second one) define likelihood of the assignment given class labels to the nodes and edges correspondingly. They could be conditioned by the data.

    We may have some misunderstanding here. As I always thought, potential functions correspond to the factors of the whole probability density function, not to some "likelihood". Then, likelihood function appears when data is conditioned by (on?) something, not something by data.

  2. Unknown says:
    2 June 2010 at 02:42

    You are right, in theory the potential functions correspond to the factors, but in practice the only constraint is non-negativity (before taking logarithms; exp(φ) > 0 in my notation), so they can be even greater than one. Probabilistic reasoning is desirable, but not always practical.
    You are also right about the likelihood. That is not likelihood in mathematical sense. Since I am trying just to give the intuition, I replaced it with the neutral "possibility".

  3. Yaroslav Bulatov says:
    29 August 2011 at 13:24

    Has anyone in vision tried Semidefinite Programming for finding lowest energy labeling? SDP relaxation is more powerful than LP relaxation, and it's straightforward to setup given access to SDP solver like CVXOPT -- http://mathematica-bits.blogspot.com/2011/03/semidefinite-programming-in-mathematica.html

  4. Unknown says:
    30 August 2011 at 02:36

    I have not seen anything like this so far.
    From a quick look at the Wikipedia article it is not clear to me how to use it.
    LP relaxation is not generally used directly since it is slow, but the methods like TRW and submodular decomposition optimize the same function.
    So how do you think, is it enough to formulate the problem in terms of SDP, or we also need to invent optimization techniques?

  5. Yaroslav Bulatov says:
    9 September 2011 at 03:11

    Best explanation of SDP for integer programming is Chapter 6 of http://www.designofapproxalgs.com/book.pdf

    Maybe you would need a specialized SDP solver to be practical. For instance, Weinberger in "Distance Metric Learning for Large Margin Nearest Neighbor Classification" developed their own SDP optimizer. Also, I was recommended the following algorithm for cheaply solving the SDP approximately -- http://www.cs.berkeley.edu/~satishr/cs294/scribe/2-17.pdf

Post a Comment