Max-product optimization methods and software

I hoped I would never post a sorry-for-not-posting message, but I do (and now I feel like I owe you a long post =). In fact, I have been being really busy, particularly with putting my masters thesis down to paper. However, it has a positive outcome: I systematized my knowledge about inference techniques in MRFs, and I'd like to share them here.

Please don't consider it a survey. While it claims to be quite comprehensive, it is not enough deep and verbose. You are not gonna see a lot of formulas, but the links to papers and implementations along with the recommendations on applicability of the methods and the software. It could be useful for anybody who wants to quickly try energy minimization in her task, but don't wanna spend a lot of time digging into the literature and implementing algorithms. So let's call this form of posting unsurvey. You can always find more details on the topic in Stan Li's book (not to be confused with Stan Lee recently appeared in The Big Bang Theory).

Max-product energy minimization problem arises in a number of fields such as statistical physics, social network analysis, and, of course, computer vision, where the problems like image denoising, dense stereo reconstruction, semantic segmentation lead to the single discrete optimization problem: maximize the product of some functions defined over nodes and edges of the graph G=(N,E), typically a grid over image pixels in low-level vision. Due to using logarithms, it is equivalent to maximizing the sum


where each Yi corresponds to a node of the graph and takes on a value from a fixed discrete range. It could be a class label in semantic segmentation, or a disparity value in the pixel in dense stereo. So-called potential functions φ (unary in the first term and pairwise in the second one) define possibility of the assignment given class labels to the nodes and edges correspondingly. They could be conditioned by the data. For example, unary potentials could reflect colour information in the pixel, and pairwise potentials could enforce assignment of the similar labels to the ends of an edge to provide smoothness in the resulting labelling. If the graph contains cliques of the order 3 and greater, one should consider more terms (as I explained earlier), but in practice higher order cliques are often ignored. Also, an important particular case, 4-connected grid, contains only second order cliques, that's why the research community focused on this "pairwise" problem.

In general, this maximization problem is NP-hard. That's why in the 1980s they solved it using genetic algorithms or simulated annealing. Fortunately, specific methods have been developed.

First of all, there are two cases when the exact global optimum could be found in polynomial time. The first is when the graph has a tree structure. Actually, I don't know application where this case is useful. Probably, it is good for analysis of parsing trees in natural language processing. In this case the maximum is found using the dynamic programming based belief propagation algorithm. One node is selected as the root, and messages are passed from the root and then back to the root. The optimal assignment is thus found.

Another polynomial case allows an arbitrary graph topology, but restricts the number of labels and the form of pairwise potentials. Only one of the two labels could be assigned to each node. Pairwise potentials should be submodular, i.e. φ(0,0) + φ(1,1) >= φ(0,1) + φ(1,0). Thus, the binary assignment implies that all the nodes should be divided into the two groups according to their labels. This division is found using the algorithm for finding minimum cut on the supplementary graph. Two fake nodes are added to the graph and declared as the source and the sink. Both fake nodes are adjacent to all the nodes of the "old" graph. Using your favourite min-cut/max-flow algorithm you can separate the nodes into two sets, which gives you the resulting labelling.

Unfortunately, the binary labelling problem is not so common too. In practice people want to choose one of the multiple labels for the nodes. This problem is NP-hard, but the iterative procedures of α-expansion and αβ-swap give good approximations with certain theoretical guarantees. [Boykov et al., 2001] During each iteration the min-cut algorithm is run, so there is an important constraint. Each projection to any two variables of pairwise potentials should be submodular. If it is hold for your energy, I recommend you to use this method since it is really fast and accurate. The implementation by Olga Veksler is available.

What should we do if the energy is not submodular and the graph contains cycles? There are some methods too, but they are not so fast and typically not so effective. First of all, the stated integer programming problem could be relaxed to linear programming. You can use any LP solver, but it will eventually take days to find the optimum. Moreover, it won't be exact, because the relaxed problem solution could be rounded back in multiple ways. No surprise here, because the possibility of getting the global optimum would prove that P = NP.

One of the methods for approaching the general problem is Loopy Belief Propagation [Kschischang, 2001]. No doubt, it is really loopy. Dynamic programming is used on the graph with loops, and it can eventually fall into the endless loop passing some messages along the loop. It is now considered outdated, but you can still try it, it is implemented in libDAI. The input is a factor graph, which is inconvenient in most cases, but really flexible (you could create cliques of arbitrary order).

State-of-the-art approach in general case, tree-reweighted message passing (TRW), has been developed by Wainwright and Kolmogorov. [Kolmogorov, 2006] It is based on the decomposition of the original graph to some graphs where exact message-passing inference is possible, i.e. trees. Message passing is done iteratively on the trees, and trees are enforced to assign the same values in particular nodes. The procedure is guaranteed to converge, but unfortunately not always to the global maximum. However, it is better than Loopy BP. Kolmogorov's implementation contains LBP and TRW-S under a common interface, so you can compare their performance. To understand the decomposition technique better, you could read Komodakis's paper [2007]. It describes a general decomposition framework.

Once a company of the world-leading MRF researchers gathered and implemented LBP, graph-cuts and TRW-S under common interface, investigated performance and even shared their code on a Middlebury college page. It would be great to use those implementations interchangeably, but unfortunately the interface is tailored for the stereo problem, i.e. it is impossible to use the general form of potentials. My latest research is about usage MRFs for laser scan classification, where the topology of the graph is far from a rectangular grid, and the potentials are more complex than the ones in the Potts model. So I ended up in writing my own wrappers.

Another interesting method was re-discovered by Kolmogorov and Rother [2007]. It is known as quadratic pseudo-boolean optimization (QPBO). This graph-cut based method can either assign a label to a node, or reject to classify it. The consistency property is guaranteed: if the label has been assigned, this exact label is assigned in the global optimum. The equivalent solution could be obtained using a specific LP relaxation with 1/2 probabilities for class labels allowed. I have not used the technique, so I cannot recommend an implementation.

To summarize, the algorithm for choosing an optimization method is the following. If your graph is actually a tree, use belief propagation. Otherwise, if your energy is submodular, use graph-cut based inference; it is both effective and efficient (even exact in the binary case). If not, use TRW-S, it often yields a good approximation.

Read Users' Comments (5)