## On MRF factorization

Recently, I figured out I didn't understand two simple things about MRFs:

1. When we are talking about MRF factorization, in the formulation of the likelihood
should we consider all the cliques in the network or only maximal ones?
2. Generally, why do we have a right to factorize the likelihood this way?
I asked some folks, and it turned out that nobody understood this basic property clearly (In spite of Dmitry Vetrov assured me he had explained that fact in his courses). Finally, I've found the answers in Christopher Bishop's book. If you are not familiar with Markov Random Fields or graphical models, I recommend you to read the book, it is relatively understandable.

So, let's start with the second question. For those of you who like names, it is exactly the necessity part of the Hammersley-Clifford theorem. As Bishop put it:
If we consider two nodes xi and xj that are not connected by a link, then these variables must be conditionally independent given all other nodes in the graph. This follows from the fact that there is no direct path between the two nodes, and all other paths pass through nodes that are observed, and hence those paths are blocked. This conditional independence property can be expressed as
where x\{i,j} denotes the set x of all variables with xi and xj removed. The factorization of the joint distribution must therefore be such that xi and xj do not appear in the same factor in order for the conditional independence property to hold for all possible distributions belonging to the graph.
One could find the answer to the first question in Bishop's book as well. Normally, we should include all the cliques to the product. Actually, every function of variables X is also a function of any Y that is a superset of X. Based on that, a potential for a non-maximal clique could be merged into a potential for any maximal clique such that it is a supergraph of the first clique! Thus, both formulations (with all cliques and only maximal cliques) are equally expressive.

## On image labelling

Labelling data is a labourous side task that arises in most computer vision projects. Since the developers usually don't want to spend their time for such a dumb work, there exist a number of workarounds. Let me enumerate some I've heard of:

1. At Academia, the task of labelling is usually being endured on [PhD] students' broad shoulders. The funny part is the students are not always enrolled in the relevant project. At Graphics & Media Lab, students who have not attended enough seminars by the time of revision, should label some data sets for the lab projects.
2. One could also hire some people to label her data. Since the developers/researchers are relatively high-paid, it is economic to hire other folks (sometimes, they are students as well). UPDATE: hr0nix mentioned in the comment that there exists the Mechanical Turk service that helps requesters to find contractors.
3. The more witty way is to use applied psychology. For example, Google transformed the labelling process to the game. During the gameplay, you and your randomly chosen partner tag images. Sooner you tag an image with the same tag, more points you get. The brilliant idea! Believe or not, when I first saw it, I was carried away and could not stop playing until my friends dragged me out for a pizza!
4. The most revolutionary approach was introduced by Densey Tan. Here is a popular explanation of what he has done. The idea is to capture labels straight from one's brain using EEG/fMRI/whatnot. Now they can perform only 2 or 3 class labelling, but (I hope) it is only the beginning.
The last point reminds me my old thoughts about the future of machine learning (or at least ML applied to Vision). Nowadays we deal with ensembles of weak classifiers, such as decision trees, stamps etc. One can use guinea pigs as weak classifiers! I suppose their brain is developed enough to understand 3d structure of the scene in the way human brain does, while modern computer vision systems lack for this ability. The animals are to be learned, for example, by experiencing an electric shock in case of wrong answers. Now, it is not obligatory to train "experts", it is sufficient to analyse their brain activity. Isn't it a breakthrough? :)

## OpenCV bindings

If you are a computer vision researcher or an engineer, you cannot miss OpenCV library. Even if you are not, it could be useful to you. For example, here is a funny application: camera shots a programmer's face when a merge fails. If you are not familiar with the library, I recommend you to look through the list of its features on Wikipedia.

OpenCV is written in C to be extremely portable (for example, to DSP). The fact is C is not very popular nowadays. The recent release 2.0 contains (besides the other decent stuff) also C++ and Python wrappers. What about the other languages?

There are a number of C# wrappers. The most known is EmguCV, which is reported to be the only C# wrapper that supports OpenCV 2.0 (actually, I don't now what it means, but I suppose the API should correspond the C++ interface). It is distributed under GPL or the "Commercial License with a small fee".

As for Java, JavaCV seems to be the only viable wrapper. It also contains wrappers for other popular libraries like FFmpeg. It is also distributed under GPL, but the author promised to discuss weakening it if needed.

OpenCV was being supported by Intel, but it became a FOSS project recently. They are also going to participate in Google Summer of Code. If they will succeed, you might try to apply. I think it is a nice experience to develop such a popular library and be paid for it. :)

One of the fields they want to develop is augmented reality support for Android operating system. When I get known that there is an AR API in Android, I decided to try it. So, this is going to be a good opportunity!

UPD (Aug 6, 2011). There appeared a Haskell (!) wrapper for OpenCV by Noam Lewis.
Also, OpenCV folks are developing the official Java wrapper. Looking forward to use it!

## Computer Vision: Fact & Fiction

I was surfing the web today and came upon Stanford CS 223B course (Introduction to Computer Vision), which is said to be fucking hard. The first course homework is to watch the series of films "Computer Vision: Fact & Fiction" where computer vision stars (like David Forsyth and Andrew Zisserman) analyse computer vision technologies featured in Hollywood movies. The videos require no background in Vision and might be interesting to everyone. To me, it is also interesting to see how the famous vision folks look and talk.

My friend Tolya Yudanov spoke about that to talk about realistic in The Terminator movie is like "arguing about physical correctness of animÃ©. Terminator is the complex AI of the future, and it is stupid to apply modern computer vision criteria to it." So, it is a good illustration of the concept of computer blindness. I encourage you to watch the videos, they are worth watching.

## The disruption of scholarly publishing [in Russia]

You know, I am relatively new to the big science. The first 2-column paper I read was Antonin Guttman's paper about R-Trees.1 It was about 2.5 years ago. So, I've never used published journals or conference proceedings. I have been wondering why do they spend money to printing journals if the researchers usually publish their papers on their home pages. Recently, I've read the article about the disruption of scholarly publishing by Michael Clarke. The article is quite dragged out, so I summarize its ideas in the following paragraphs.

When Tim Berners-Lee came up with the idea of WWW in 1991, he thought of its purpose as some kind of scientific media, which would replace conferences and journals. Nowadays, the Internet is used for social networking, illegal distribution of music and pornography, but we still have journals and conferences off-line. Why?

The author indicates 5 main points:
• Dissemination - journals distribute papers all over the world
• Registration of discovery - to know who was the first, Popov or Marconi
• Validation - to ensure that the results are correct; now provided by peer-review
• Filtration - you are likely to read a paper from the journal with bigger impact factor
• Designation - if you have publications on top conferences, you might be a cool scientist
The author argues that only the last point is critical for the modern system of journals and conferences. The others could be better handled with on-line services, we already have a lot of examples.

But in Russia, we don't rank scientists according to their citation index!2 (because we don't have one :) ) So, the designation problem is not solved here, and Russia is already ready to the new publishing system. Why do we still have journals? I don't know, probably the sluggishness of minds...

2 I am cunning a bit: some journals carry out designation purposes. I mean journals published by VAK. One should have at least one publication in such a journal to get Candidate of Science degree (Russian PhD equivalent). But the review process is usually lousy, there was also Rooter case with one of them.

## Testing machine learning algorithms

Tests are the only way to estimate the quality of a machine learning algorithm in practice. In order to prove your algorithm is usable, you need to design good tests. To design test you should collect the data and split them into the train set and the test set.

Machine learning theorists say that the train and the test sets should come from the single probability distribution (unless they are talking about transfer learning or some kinds of on-line learning). But in practice it is a bit more complicated. We are currently working on the problem of laser scan points classification. It is not a trivial task to design tests! We have scans from different domains (aerial scans, scans from moving vehicles, stationary terrestrial scans), and for each domain we would like to have kind of universal benchmark. It means that a wide range of algorithms are supposed to be tested with the test, so the test may not stimulate overfitting.

So, how can we split the data? To satisfy the claim of the single distribution, we can add the odd points from the cloud to the train set and even points to the test set. This is a bad idea. Suppose your classifier use 3D coordinates of a point as the features. For each point in the test set, we have a similar point in the train set. Therefore we get nearly 100% precision using such a primitive learner. Such benchmark is not enough challenging.

Well, let's split every scan into a few pieces then. If we compose the test set from different subscans, does it solve the problem? Not at all. For example, we have a number of aerial scans. The scans can be retrieved from different heights, different scanners, in different weather. So, if we add the pieces of a single scan both to the test set and to the train set, we will get a non-challenging test again. The rule is: the pieces of a single scan may not persist both in the test set and the train set, if we want to train the classifier once for the whole domain. Do the test set and the train set come from the single distribution? No! But we need to neglect the theory in favour of practice.

One could say that it is reasonable to use cross-validation here. Well, it makes a sense. According to Wikipedia, there are three types of cross-validation:
• Repeated random sub-sampling validation
• K-fold cross-validation
• Leave-one-out cross-validation
According to the rule, only k-fold x-validation can be used, and each fold should contain points from its own scans. But it is very laborious to label scans. It takes more than 20 hours to label a standard million-points scan. So, we cannot have a lot of scans labelled.

This is not the only problem with testing point classification. Since the single point does not tell us anything, we should consider some neighbourhood of it, and approximate it with a surface. For every point in both sets there should be some neighbourhood. The problem is solved too you put the whole scan to the set.