## PGM-class and MRF parameter learning

I’m taking Stanford CS 228 (a.k.a. pgm-class) on Coursera. The class is great, I guess it provides close to the maximum one can do under the constraints of remoteness and bulkness. The thing I miss is theoretical problems, which were taken aside from the on-line version because they could not be graded automatically.

There is an important thing about graphical models I fully realized only recently (partly due to the class). This thing should be articulated clearly in every introductory course, but is often just mentioned, probably because lecturers consider it obvious. The thing is

**there is no probabilistic meaning of MRF potentials whatsoever**. The partition function is there not only for amenity: in contrast to Bayesian networks, there is no general way to assign potentials of an**undirected**graphical model to avoid normalization. The loops make it impossible. The implication is one should not assign potentials by estimating frequencies of assignments to factors (possibly conditioned on features) like I did earlier. This is quite a bad heuristic because it is susceptible to overcounting. Let me give an example.
For the third week programming assignment we needed to implement a Markov network for handwriting OCR. The unary and pairwise potentials are somewhat obvious, but there was also a task to add ternary factors. The accuracy of the pairwise model is 26%. Mihaly Barasz tried to add ternary factors with values proportional to trigram frequencies in English, which decreased performance to 21% (link for those who have access). After removing pairwise factors, the performance rose to 38%. Why has the joint model failed? The reason is overcounting evidence: different factor types enforce the same co-occurrences, thus creating bias towards more frequent assignments, and this shows it can be significant. Therefore, we should train models with cycles discriminatively.

One more thought I’d like to share: graphical model design is similar to software engineering in the way that the crucial thing for the both is eliminating insignificant dependencies on the architecture design stage.

Yaroslav Bulatovsays:20 April 2012 at 10:41

Here's an interesting side point -- if you could assign potentials to avoid global normalization then computing the partition function could be done efficiently with dynamic programming. And vica versa, if you can compute the partition function efficiently, then you could rewrite your model as a product of factors which can be estimated using simple counting.