Computer Blindness

NIPS 2016

2016-12-16T15:00:00.000+03:00

I have just returned from NIPS (technically, I am still continuing travelling in Canary Islands). Surprisingly, this is my first NIPS. I have enjoyed presentations of great work, as well as communicating with machine-learning folks and friends. Here is an edited high-level write-up I prepared for my co-workers at BlippAR, hence the focus on computer vision and deep learning. The scope of the conference is of course much broader.

In the beginning of the conference, Yann LeCun gave a keynote, which may be considered a program speech for the whole conference. Besides starting the layered cake meme, he mentioned two ideas that he considers brightest in the last decade: adversarial training and external memory. My perception is that the most frequent buzzwords at the conference are:

generative adversarial networks (GAN),
recurrent neural networks (RNN),
reinforcement learning (RL),
robotics (done with or without RL).

Great engineers are good at defining useful abstractions. This LeCun’s layered cake metaphor was heavily re-used throughout the conference.

Andrew Ng gave a keynote on the workflow of applied DL research. Most ideas are trivial, although are often neglected. Here are a few of them. The most important metrics to track are: human error, training set error, and validation error. The gap between human and training set error is the bias. It can be solved by a bigger model or longer training. The validation/training gap is due to the variance. Once the bias problem is addressed, you can add more data or increase regularisation to reduce variance.

In practice, however, we have a different distribution of test data than of the training data. In that case, you might want to have two validation sets: one coming from training distribution, and another from test distribution. If you don’t have the latter, you will be solving a wrong problem. Domain adaptation methods may be also helpful, but are not quite practical at the moment.

He also heavily advocates for using the centralised data storage infrastructure that is easily available to all team members. It speeds up the progress a lot.

Currently, supervised learning dominates the landscape, while unsupervised and reinforcement learning are around for some time, but still cannot take off. Ng predicts that the second-biggest areas (after SL) will be not one of those, but transfer learning and multi-task learning. Those are important practical areas almost ignored by academic researchers.

There were seemingly few papers on architectural design (for supervised learning). Daniel Sedra gave a great talk on Neural networks of stochastic depth. The idea is to drop entire layers during training in ResNet-like architectures. That allows to grow them up to 1000+ layers, which gives significant improvement in accuracy on CIFAR (and not quite significant on ImageNet). Unfortunately, at test-time layers are there, so the inference time may be a problem. Another improvement is to add identity connection between each pair of layers, which also helps a bit.

Michael Figurnov, my academic sibling, presented a work on speeding up inference using so-called perforated convolutions. The idea is to skip some multiplications that do not impact the result of the operation a lot. In the follow-up work, he and co-authors suggest to skip the whole parts of the computation graph once they are estimated to be unimportant. This can potentially lead to significant speed-up, but for most practitioners it is the question of support in libraries such as CUDNN. Hopefully, we’ll be able to use it soon.

The method suggest not to waste computation time on background, but evaluate more ResNet layers where it can pay out more.

There was a demo I missed that demonstrated real-time object detection on a mobile device (if you have a link, please share with me!). It is based on YOLO, and the idea is to pre-compute and memoize some parts of the computational graph, then at inference time find the best approximation. There is some loss in accuracy, but real-time inference on the board looks impressive!

The most interesting improvement for deep learning optimisation seems to be Weight normalisation. The paper builds on the idea of batch normalisation. It tries to explain the positive effect of it (i.e. staying on the unit sphere) and uses this intuition to propose an alternative way to stabilise the training process.

As I said above, GANs are a big (and controversial) topic here. It seems to be the most popular generative model for images, although variational autoencoders are still around. The training process is quite unstable, but the community is increasingly finding the ways to stabilise it. Most of them are hacks born in trial-and-error process. Soumith Chintala presented a list of 15 tricks in his talk How to train GAN? He promised to publish the slides later, but here is the writeup. If you will have to train GANs, make sure you read that first. There is no way you can guess this combination yourself (and papers typically do not mention all those hacks).

Poster by Soumith. TL;DR: use magic.

In the visual search field, there is a Universal correspondence network paper. Instead of matching features, it suggests to run one pass through the network to find dense correspondences. They claim that in comparison to extracting deep features for patches, the number of distance computations gets down from quadratic to linear (although, spatial index can also mitigate the issue). It appears to learn some notion of geometry/topology, i.e. one can train it to do rigid matching, as well as to look more at semantic similarity. The results they showed were on very similar images, so did not impress me much. Although it may had been done only to ease the visualisation.

Deep learning continues to disrupt traditional computer-vision pipelines.

For reinforcement learning, there is no single paper that I can pick as brightest, but the most impact seems to be done by publishing OpenAI Universe and DeepMind Lab, both are training environments for reinforcement learning algorithms stemming from computer-generated imagery (i.e. video games). It seems to be a valid step towards building intelligent agents operating in the real world, although there is a scepticism that it will be difficult to make the final step: the methods are data-hungry, and it is difficult to get a lot of data from anything rather than simulated environment, let alone having a frequent reward signal. This is one of the reasons researchers study the possibility of imitation learning, where a robot learns the cost function from demonstrations rather than taking it for granted. With this approach, we may be able to teach robots to perform tasks like pouring water into a glass by having a person or another robot showing it how to do it.

DeepMind collaborated with Blizzard to provide StarCraft API to facilitate training of RL agents. Now you have an excuse to play games at work.

Bayesian methods seem to be quite popular, and apparently can be successfully combined with neural networks (variational autoencoders and their variants are used a lot). Workshops on approximate variational inference and Bayesian deep learning have seen a large number of papers. Tutorial on variational inference (slides-pdf-1, slides-pds-2) reflected a significant progress in the recent years: things like reparametrisation trick and stochastic optimisation made possible to perform estimation of quite sophisticated posterior distributions.

Slide by David Blei. Bayesian models have been long used successfully to estimate uncertainty, in particular to glue models to get complex pipelines while avoiding information loss. It is great we can use it with modern models as well.

Surely, an important part of the conference is socialisation. So thanks to everybody I had a chance to meet at the conference and had a productive (or just fun) conversation with. See you at the future conferences!

Launching Jupyter notebook server on startup

2016-08-24T11:00:00.000+03:00

In this post, I show how to automatically run the Jupyter notebook server on an OSX machine on startup, without using extra terminal windows.

Background

Jupyter notebooks provide a useful interactive environment (think better alternative for console) for scripting languages like Python, Julia, Lua/torch, which has become a favourite tool of many data scientists for handling datasets that fit into the laptop’s memory. If you have not used it, try it next time you do data analysis or prototype an algorithm.

The Pain

There is a slight technical difficulty, though. The way Jupyter works is you run a server process on your machine (or somewhere in the network), and then access it through a web client. I have noticed a popular pattern of usage in the local-server case: a user runs the command to start the server process in the terminal then keeps the terminal window open while she is using the notebook. This method has several flaws:

you need to run the process manually each time you need a notebook (after reboot);
you have to remember the command line arguments;
the process takes up a terminal window/tab, or, if you run it in the background, you can accidentally close the tab and your RAM state will be lost.

The Solution

If your perfectionist’s feelings are hurt by those nuisances above, my setup may remedy it. It uses launchd framework that starts, stops, and manages background processes.

First, create a property list file in your ~/Library/LaunchAgents directory:

~/Library/LaunchAgents/com.blippar.jupyter-ipython.plist

and put the following text there. This is a custom format, which is not quite XML (e.g. order of children nodes apparently matters). I highlighted red the paths and other variables you will have to change:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.blippar.jupyter-ipython</string>
<key>ProgramArguments</key>
<array>
<string>zsh</string>
<string>-c</string>
<string>source ~/.zshrc;workon py3;jupyter notebook --notebook-dir /Users/roman/jupyter-nb --no-browser</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>StandardErrorPath</key>
<string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stderr</string>
<key>StandardOutPath</key>
<string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stdout</string>
</dict>
</plist>

The most interesting part is the ProgramArguments entry. As it is impossible to specify several commands as an argument, I had to run a separate shell instance and pass the commands separated by semicolons. If you do not need to set the virtual environment (workon py3), you might get away without calling the shell, although I recommend always using virtualenv for python on a local machine. Also, I use zsh; replace it with your shell of choice.

To check if everything works, run

launchctl load com.blippar.jupyter-ipython

launchctl start com.blippar.jupyter-ipython

and open http://localhost:8888/tree in your browser. Hopefully, you see the list of available notebooks.

Linux users, you should be able to replicate this with systemd (although, chances are you already know that =).

Asynchronous Fitter design pattern for training in IPython notebooks

2015-10-09T12:28:00.000+03:00

[Cross-posted from CodeReview@SX]

The problem

When you work with Python interactively (e.g. in an IPython shell or notebook) and run a computationally intensive operation like fitting a machine-learning model that is implemented in a native code, you cannot interrupt the operation since the native code does not return execution control to the Python interpreter until the end of the operation. The problem is not specific to machine learning, although it is typical to run a training process for which you cannot predict the training time. In case it takes longer that you expected, to stop training you need to stop the kernel and thus lose the pre-processed features and other variables stored in the memory, i.e. you cannot interrupt only a particular operation to check a simpler model, which allegedly would be fit faster.

The solution

I propose an Asynchronous Fitter design pattern that runs fitting in a separate process and communicates the results back when they are available. It allows to stop training gracefully by killing the spawned process and then run training of a simpler model. It also allows to train several models simultaneously and work in the IPython notebook during model training. Note that multithreading is probably not an option, since we cannot stop a thread that runs an uncontrolled native code.

Here is a draft implementation:

from multiprocessing import Process, Queue
import time

class AsyncFitter(object):
    def __init__(self, model):
        self.queue = Queue()
        self.model = model
        self.proc = None
        self.start_time = None

    def fit(self, x_train, y_train):
        self.terminate()
        self.proc = Process(target=AsyncFitter.async_fit_, 
            args=(self.model, x_train, y_train, self.queue))
        self.start_time = time.time()
        self.proc.start()

    def try_get_result(self):
        if self.queue.empty():
            return None

        return self.queue.get()

    def is_alive(self):
        return self.proc is not None and self.proc.is_alive()

    def terminate(self):
        if self.proc is not None and self.proc.is_alive():
            self.proc.terminate()
        self.proc = None

    def time_elapsed(self):
        if not self.start_time:
            return 0

        return time.time() - self.start_time

    @staticmethod
    def async_fit_(model, x_train, y_train, queue):
        model.fit(x_train, y_train)
        queue.put(model)

Usage

It is easy to modify a code that uses scikit-learn to adopt the pattern. Here is an example:

import numpy as np
from sklearn.svm import SVC

model = SVC(C = 1e3, kernel='linear')
fitter = AsyncFitter(model)
x_train = np.random.rand(500, 30)
y_train = np.random.randint(0, 2, size=(500,))
fitter.fit(x_train, y_train)

You can check if training is still running by calling fitter.is_alive() and check the time currently elapsed by calling fitter.time_elapsed(). Whenever you want, you can terminate() the process or just train another model that will terminate the previous one. Finally, you can obtain the model by try_get_result(), which returns None when training is in progress.

The issues

As far as I understand, the training set is being pickled and copied, which may be a problem if it is large. Is there an easy way to avoid that? Note that training needs only read-only access to the training set.

What happens if someone loses a reference to an AsyncFitter instance that wraps a running process? Is there a way to implement an asynchronous delayed resource cleanup?

X Bridges of Königsberg

2013-02-15T22:03:00.000+04:00

The recent Spiked Math comic strip proposes some cheats to build an Eulerian path over the famous bridges of Königsberg, which was originally impossible. Here is the strip:

It turns out that in practice another solution worked out (in fact, the mix of the engineer’s and mythbuster’s solutions). You can ruin the city during a world war and then transfer it to Russians, who will build the new bridges instead. I’ve been to Kaliningrad (the city has been renamed too) two years ago, but did not try to solve the Euler’s problem in field. Let’s find out if it is possible now. Look at the modern map:

View Königsberg bridges in a larger map

So, the modified Spiked Math’s scheme looks as follows (the new bridges are shown in light blue):

Is there an Eulerian path nowadays, i.e. can you walk through the city and cross each bridge once and only once? Recall the theorem:

A connected graph allows an Eulerian path if and only if it has no more than two vertices of an odd degree.

In the picture above, two pieces of land have even number of incident bridges, while the Kneiphof island in the center has degree 3, and the left bank (in the bottom) has degree 5. This means that now Eulerian paths exist. Any such path should begin in the left bank and end in the island, or vice-versa. It may be a good idea to finish your journey near the grave of Immanuel Kant. :) However, there is no Eulerian conduit, i.e. closed Eulerian path.

PS. 1. If you zoom out the above map, you’ll see one more piece of land and two more bridges on the East. However, it only adds a vertex of degree 2, and swaps parity of the banks, so the properties of the new graph are similar, you just need to start/finish your journey in the right bank, not the left one (and walk 10 km more).

2. When I was preparing this article, I found a similar analysis. It used an old map, so the Jubilee bridge and the 2nd Estacade bridges were not marked (they were built in 2005 and 2012, respectively), and also ignored the Reichsbahnbrüke (railroad bridge, but seems pedestrian-accessible).

3. It would be more interesting to analyze the bridge graphs of other cities with more islands like St. Petersburg (several hundred bridges) or Amsterdam (numerous bridges/canals, but have structure).
4. Here is Euler’s original paper (in Latin) with his drawings of the above schemes.

Machine learning in facebook

2012-12-04T16:16:00.000+04:00

Last week Max Gubin gave a talk on how Facebook exploits machine learning. The talk was more technical then one might had expected, so I share some interesting facts in this posts. I hope it doesn’t disclose any secret information. :)

Max started with a screenshot of the Facebook interface, where each element was highlighted as beneficial of machine learning. Just to name some, they use learning to predict the order of stories in the newsfeed, groups/ads/chat contacts to show you, and even occasions when your account is supposedly hacked, so you need to verify your identity. For most of the tasks learning is probably trivial, but at least two of them involve complicated algorithms (see below).

Facebook engineers face number of difficulties. A user expects to load a page almost instantly, though network infrastructure already imposes some lag. To avoid further delaying, prediction should be done in tens of microseconds. Moreover, half a billion of daily-active users send a lot of queries, so massively-parallel implementations would be too expensive. They have no choice other than sticking to linear models. For example, they train a linear fitness function to rank the stories for the newsfeed (using e.g. hinge loss or logistic loss). It should be trained to satisfy multiple criteria, often contradicting. For example, maximizing personal user experience (showing most interesting stories) might hurt experience of other users (if one has few friends, they are the only users who can read his/her posts) or degrade the system as a whole (showing certain types of news might be not really interesting to anyone, while necessary to improve connectivity of the social network). Those criteria should be balanced in the learning objective, and the coefficients are changing over time. Even the personal user experience cannot be measured easily. The obvious thing to try is to ask users to label interesting stories (or use their Likes). However, such tests are always biased: Facebook tried to use this subjective labelling three times, and all of them were unsuccessful. Users just don’t tell what they really like.

Another challenge is the quickly-changing environment. For example, interest to specific ads may be seasoned. In advertising, one of the strategies is to maximize the click-through rate (CTR). The model for personalized ads should be able to learn online to adapt to changes efficiently. They use probit regression, where online updates can be written in a closed form, unlike to logistic regression (note the linear model again!). It is based on Microsoft’s TrueSkill™ method for learning ranks of players to find good matches and seems similar to what Bing uses for CTR maximization [Graepel et al., 2010].

Finally, Max mentioned the problem of estimating new features. The common practice in the industry is A/B testing, where a group of users is randomly selected to test some feature, and the rest of users are treated as the control group. Then they compare the indicators for those two groups (e.g. average time spent on the website, or clicks made on the newsfeed stories) and apply statistical tests. As usual, samples are typically small. For example, if they want to test a feature for search in Chinese, they take a small group of 10 million users, and hope that some of them will query in Chinese (recall that Facebook is unavailable in China). It is typically hard to prove a statistically significant improvement.

It was partially a hiring event. If you are looking for an internship or a full-time job, you may contact to their HR specialist in Eastern Europe Marina. Facebook also keeps in touch with universities, e.g. invites professors to give talks in their office or develop joint courses. Professors may apply, but I don’t know a contact for that.

PGM-class and MRF parameter learning

2012-04-16T02:42:00.002+04:00

I’m taking Stanford CS 228 (a.k.a. pgm-class) on Coursera. The class is great, I guess it provides close to the maximum one can do under the constraints of remoteness and bulkness. The thing I miss is theoretical problems, which were taken aside from the on-line version because they could not be graded automatically.

There is an important thing about graphical models I fully realized only recently (partly due to the class). This thing should be articulated clearly in every introductory course, but is often just mentioned, probably because lecturers consider it obvious. The thing is there is no probabilistic meaning of MRF potentials whatsoever. The partition function is there not only for amenity: in contrast to Bayesian networks, there is no general way to assign potentials of an undirected graphical model to avoid normalization. The loops make it impossible. The implication is one should not assign potentials by estimating frequencies of assignments to factors (possibly conditioned on features) like I did earlier. This is quite a bad heuristic because it is susceptible to overcounting. Let me give an example.

For the third week programming assignment we needed to implement a Markov network for handwriting OCR. The unary and pairwise potentials are somewhat obvious, but there was also a task to add ternary factors. The accuracy of the pairwise model is 26%. Mihaly Barasz tried to add ternary factors with values proportional to trigram frequencies in English, which decreased performance to 21% (link for those who have access). After removing pairwise factors, the performance rose to 38%. Why has the joint model failed? The reason is overcounting evidence: different factor types enforce the same co-occurrences, thus creating bias towards more frequent assignments, and this shows it can be significant. Therefore, we should train models with cycles discriminatively.

One more thought I’d like to share: graphical model design is similar to software engineering in the way that the crucial thing for the both is eliminating insignificant dependencies on the architecture design stage.

When and why it is safe to cast floor/ceil result to integer

2012-01-05T20:05:00.000+04:00

Happy new year, ladies and gents! I am back, and the following couple of posts are going to be programming-related.

The C/C++ standard library declares floor function such that it returns a floating-point number:

     double floor (      double x );
      float floor (       float x );  // C++ only

I used to be tortured by two questions:

Why would the floor function return anything other than integer?
What is the most correct way to cast the output to integer?

At last I figured out the answer to both questions. Note that the rest of the post applies to the ceil function as well.

For the second point, I knew the common idiom was just type-casting the output:

double a;

int intRes = int(floor(a));  // use static_cast if you don't like brevity

If you’ve heard anything about peculiarities of floating-point arithmetic, you might start worrying that floor might return not exact integer value but $\lfloor a \rfloor - \epsilon$, so that type-casting is incorrect (assume $a$ is positive). However, this is not the case. Consider the following statement.

If $a$ is not integer and is representable as IEEE-754 floating-point number, than both $\lfloor a \rfloor$ and $\lceil a \rceil$ are representable within that domain. This means that for any float/double value floor can return a number that is integer and fits in float/double format.

The proof is easy but requires understanding of representation of floating-point numbers. Suppose $a = c \times b^q, c \in [0.1, 1)$ has $k$ significant digits, i.e. $k$-th digit after the point in $c$ is not null, but all the further ones are. Since it is representable, $k$ is less then the maximum width of significand. Since $a$ is not integer, both $\lfloor a \rfloor$ and $\lceil a \rceil$ have significands with the number of significant digits less then $k$. Rounding however might increase the order of magnitude but only by one. So, the rounded number’s significand fits in $k$ digits, Q.E.D.

However, this does not mean one can always type-cast the output safely, since, for example, not every integer stored in double can be represented as 4-byte int (and this is the answer to the question #1). Double precision numbers have 52 digit significands, so any integer number up to $2^{52}$ can be stored exactly. For the int type, it is only $2^{31}$. So if there is possibility of overflow, check before the cast or use the int64 type.

On the other hand, a lot of int’s cannot be represented as floats (also true for int64 and double). Consider the following code:

std::cout << std::showbase << std::hex

      << int(float(1 << 23)) << " " << int(float(1 << 24)) << std::endl

      << int(float(1 << 23) + 1) << " " << int(float(1 << 24) + 1) << std::endl;

The output is:

0x800000 0x1000000
0x800001 0x1000000

$2^{24}+1$ cannot be represented as float exactly (it is represented as $2^{24}$), so one should not use the float version of floor for numbers greater than few millions.

There are classes of numbers that are guaranteed to be represented exactly in floating point format. See also the comprehensive tutorial on floating-point arithmetic by David Goldberg.

Kinect Apps Challenge

2011-11-25T20:54:00.001+04:00

Graphics & Media Lab and Microsoft Research Cambridge have announced a contest for applications that use Kinect sensor. The authors of the five brightest apps will be funded to attend the 2012 Microsoft Research PhD summer school. I blogged about the last year's event in my previous post. Details of the contest are here.

I guess there's no need to explain what Kinect is. It is extremely successful combined colour and depth sensor by Microsoft. It is being distributed for killer price (≈$150) as an add-on for Xbox, although it is hard to imagine the range of possible applications. Controlling a computer only by gestures is considered as a primer of natural user interface. To name some applications beyond gaming, this kind of NUI is useful to help surgeons to keep their hands clean:

Kinect helps blind people to navigate through buildings:

For more ideas look at the winners of OpenNI challenge.

OpenNI is an open-source alternative to Kinect SDK. Its strong point is it is integrated with PCL, but beware that you cannot use it for the contest. Only Kinect for Windows SDK is allowed.

Kinect is probably the most successful commercial outcome from the Microsoft Research lab. MSR is unique because they do a lot of theoretical research there, and it is unclear if it is profitable for Microsoft to fund it. But the projects like Kinect reveal the doubts. I will post about MSR organization in comparison to other industrial labs in one of the later posts.

All the summer schools

2011-08-16T03:35:00.005+04:00

This summer I attended two conferences (3DIMPVT, EMMCVPR) and two summer schools. I know my latency is somewhat annoying, but it's better to review them now then never. :) This post is about the summer schools, and the following is going to be about the conferences.

PhD Summer School in Cambridge

Both schools were organized by Microsoft Research. The first one, PhD Summer School was in Cambridge, UK. The lectures covered some general issues for computer science PhD students (like using cloud computing for research and career perspectives) as well as some recent technical results by Microsoft Research. From the computer vision side, there were several talks:

Antonio Criminisi described their InnerEye system for retrieval of similar body part scans, which is useful for diagnosis based on similar cases' medical history. He also featured the basics of Random Forests as an advertisement to his ICCV 2011 tutorial. The new thing was using peculiar weak classifiers (like 2nd order separation surfaces). Antonio argued they perform much better then trees in some cases.
Andrew Fitzgibbon gave a brilliant lecture about pose estimation for Kinect (MSR Cambridge is really proud of that algorithm [Shotton, 2011], this is the topic for another post).
Olga Barinova talked about the modern methods of image analysis and her work for the past 2 years (graphical models for non-maxima suppression for object detection and urban scene parsing).

The other great talks were about .NET Gadgeteer, the system for modelling and even deployment of electronic gadgets (yes, hardware!), and F#, Microsoft's alternative to Scala, the language that combines object-oriented paradigm with functional. Sir Tony Hoare also gave a lecture, so I had a chance to ask him how he ended up in Moscow State University in the 60s. It turns out he studied statistics, and Andrey Kolmogorov was one of the leaders of the field that time, so that internship was a great opportunity for him. He said he had liked the time in Moscow. :) There were also magnificent lectures by Simon Peyton-Jones about giving talks and writing papers. Those advices are the must for everyone who does research, you can find the slides here. Slides for some of the lectures are available from the school page.

The school talks did not take all the time. Every night was occupied by some social event (go-karting, punting etc.) as well as unofficial after-parties in Cambridge pubs. Definitely it is the most fun school/conference I've attended so far. Karting was especially great, with the quality track, pit-stops, stats and prizes, so special thanks to Microsoft for including it to the program!

Microsoft Computer Vision School in Moscow

This year, Microsoft Research summer school in Russia was devoted to computer vision and organized in cooperation with our lab. The school started before its official opening with a homework assignment we authored (I was one of four student volunteers). The task was to develop an image classification method capable to distinguish two indoor and two outdoor classes. The results were rated according to the performance on the hidden test set. Artem Konev won the challenge with 95.5% accuracy and was awarded a prize consisted of an xBox and Kinect. Two years ago we used those data for the projects on Introduction to Computer Vision course, where nobody reached even 90%. It reflects not just the lore of participants, but also the progress of computer vision: all the top methods used PHOW descriptors and linear SVM with approximate decomposed χ² kernel [Vedaldi and Zisserman, 2010], which were unavailable that time!

In fact, Andrew Zisserman was one of the speakers. Andrew is the most cited computer vision researcher and the only person whose Zisserman number is zero. :) His course was on Visual Search and Recognition, including instance-level and category-level recognition. The ideas that were relatively new:

when computing visual words, sometimes it is fruitful to use soft assignments to clusters, or more advanced methods like Locality-constrained linear coding [Wang et al., 2010];
for instance-level recognition it is possible to use query expansion to overcome occlusions [Chum et al., 2007]: the idea is to use the best matched images from the base as new queries;
object detection is traditionally done with sliding window, the problems here are: various aspect ratio, partial occlusions, multiple responses and background clutter for substantially non-convex objects;
for object detection use bootstrapped sequential classification: on the next stage take the false negative detections from the previous stage as negative examples and retrain the classifier;
multiple kernel learning [Gehler and Nowozin, 2009] is a hot tool that is used to find the ideal linear combination of SVM kernels: combining different features is fruitful, but learning the combination is not much better than just averaging (Lampert: “Never use MKL without comparison to simple baselines!”);
movies are common datasets, since there are a lot of repeated objects/people/environments, and the privacy issues are easy to overcome. The movies like Groundhog Day and Run Lola Run are especially good since they contain repeated episodes. You can try to find the clocks on Video Google Demo.

Zisserman talked about PASCAL challenge a lot. During a break he mentioned that he annotated some images himself since “it is fun”. One problem with the challenge is we don't know if the progress over years really reflects the increased quality of methods, or is just because of growth of the training set (though, it is easy to check).

Andrew Fitzgibbon gave two more great lectures, one about Kinect (with slightly different motivation than in Cambridge) and another about continuous optimization. He talked a lot about reconciling theory and practice:

the life-cycle of a research project is: 1) chase the high-hanging fruit (theoretically-sound model), 2) try to make stuff really work, 3) look for the things that confuse/annoy you and fix them;
for Kinect pose estimation, the good top-down method based on tracking did not work, so they ended up classifying body parts discriminatively, temporal smoothing is used on the late stage;
“don't be obsessed with theoretical guarantees: they are either weak or trivial”;
on the simplest optimization method: “How many people have invented [coordinate] alternation at some point of their life?”. Indeed, the method is guaranteed to converge, but the problems arise when the valleys are not axis-aligned;
gradient descent is not a panacea: in some cases it does small steps too, conjugate gradient method is better (it uses 1st order derivatives only);
when possible, use second derivatives to determine step size, but estimating them is hard in general;
one almost never needs to take the matrix inverse; in MATLAB, to solve the system Hd = −g, use backslash: d = −H\g;
the Friday evening method is to try MATLAB fminsearch (implementing the derivative-free Nelder-Mead method).

Dr. Fitzgibbon asked the audience what the first rule of machine learning is. I hardly helped over replying “Never talk about machine learning”, but he expected the different answer: “Always try the nearest neighbour first!”

Christoph Lampert gave lectures on kernel methods, and structured learning, and kernel methods for structured learning. Some notes on the kernel methods talk:

(obvious) don't rely on the error on a train set, and (less obvious) don't even report about it in your papers;
for SVM kernels, in order to be legitimate, a kernel should be an inner product; it is often hard to prove it directly, but there are workarounds: a kernel can be drawn from a conditionally positive-definite matrix; sum, product and exponent of a kernel(s) is a kernel too etc. (thus, important for multiple-kernel learning, linear combination of kernels is a kernel);
since training (and running) non-linear SVMs is computationally hard, explicit feature maps are popular now: try to decompose the kernel back to conventional dot product of modified features; typically the features should be transformed to infinite sums, so take first few terms [Vedaldi and Zisserman, 2010];
if the kernel can be expressed as a sum over vector components (e.g. χ² kernel $\sum_d x_d x'_d / (x_d + x'_d)$), it is easy to decompose; radial basis function (RBF) kernel ($\exp (\|x-x'\|^2 / 2\sigma^2)$) is the exponent of a sum, so it is hardly decomposable (more strict conditions are in the paper);
when using RBF kernel, you have another parameter σ to tune; the rule of thumb is to take σ² equal to the median distance between training vectors (thus, cross-validation becomes one-dimensional).

Christoph also told a motivating story why one should always use cross-validation (so just forget the previous point :). Sebastian Nowozin was working on his [ICCV 2007] paper on action classification. He used the method by Dollár et al. [2005] as a baseline. The paper reported 80.6% accuracy on the KTH dataset. He outperformed the method by a couple of per cents and then decided to reproduce Dollár's results. Imagine his wonder when simple cross-validation (with same features and kernels) yielded 85.2%! So, Sebastian had to improve his method to beat the baseline.

I feel I should stop writing about the talks now since the post grows enormously long. Another Lampert's lecture and Carsten Rother's course on CRFs were close to my topic, so they deserve separate posts (I already reviewed basics of structured learning and max-product optimization in this blog). Andreas Müller blogged about the recent Ivan Laptev's action recognition talk on CVML, which was pretty similar to ours. The slides are available for all MSCVS talks, and videos will be shared in September.

There were also several practical sessions, but I personally consider them not that useful, because one hardly ever can feel the essence of a method in 1.5 hours changing the code according to some verbose instruction. It is more of an art to design such tutorials, and no one can really master it. :) Even if the task is well-designed, one may not succeed performing it due to technical reasons: during Carsten Rother's tutorial, Tanya and me spent half an hour to spot the bug caused by confusing input and index variable names (MATLAB is still dynamically typed). Ondrej Chum once mentioned how his tutorial was doomed since half of the students did not know how to work with sparse matrices. So, practical sessions are hard.

There was also a poster session, but I cannot remember a lot of bright works, unfortunately. Nataliya Shapovalova who won the best poster award, presented quite interesting work on action recognition, which I liked as well (and it is not the last name bias! :) My congratulations to Natasha!

The planned social events were not so exhaustive as in Cambridge, but self-organization worked out. The most prominent example was our overnight walk around Moscow, in which a substantial part of school participants took part. It included catching last subway train, drinking whiskey and gin, a game of ~~guessing~~ hallucinating names of each other, and moving a car from the tram rail to let the tram go in the morning. :) I also met some of OpenCV developers from Nizhny Novgorod there.

MSCVS is a one-time event, unfortunately. There are at least three annual computer vision summer schools in Europe: ICVSS (the most mature one, I attended it last year), CVML (held in France by INRIA) and VSSS (includes sport sessions besides the lectures, held in Zürich). If you are a PhD student in vision (especially in the beginning of your program), it is worth attending one of them each year to keep up with current trends in the vision community, especially if you don't go to the major conferences. The sets of topics (and even speakers!) have usually large intersection, so pick one of them. ICVSS has arguably the most competitive participant selection, but the application deadline and acceptance notification are in March, so one can apply to the other schools if rejected.

Google search by image

2011-06-25T11:27:00.004+04:00

Last week Google introduced Search by Image feature. There were a handful of web-sites that suggested content-based image retrieval in the Internet, but the quality was low, as I blogged earlier. I repeated the queries TinEye failed at, and Google image search found them both! It found few instances of Marion Lenbach (one of them from this blog, which means the coverage is large!), and I finally remembered the movie from the HOG paper: The Talanted Mr. Ripley.

So, from the quick glance Google finally accomplished what the others could not do for ages. Why did they succeed? There are two possible reasons: large facilities that allow building and storing large index efficiently, and a unique technology. The former is surely the case: it seems the engine indexed a large portion of the photos in the web. I cannot say anything about the technology: there is nothing about that among the Google's CVPR papers, so one need to do black-box testing to see which transformations and modifications are allowed.

Google seems to expand to the areas of multimedia web, even where the niche is already occupied. Recently they announced their alternative to grooveshark, the recommendation system for Music beta (the service is unfortunately available on invitation basis in the US only). The system is not based on collaborative filtering (only), they (also) analyse the content. I planned to investigate this area too, but given that it is becoming mature and thus not so alluring. I am eager to see if the service will succeed. After all, Buzz did not replace Twitter, and Orkut did not replace Facebook.

On structured learning

2011-04-26T02:20:00.003+04:00

This post is devoted to structured learning, a novel machine learning paradigm which may be used for solving different computer vision problems. For example, finding optimal parameters of graphical models is essentially a structured learning problem. I am going to give an introductory talk on structured learning for Dmitry Vetrov's graphical models class this Friday at 4:20 pm, so please feel free to drop by if you are in Moscow (the talk is to be in Russian).

Structured learning is basically a very general supervised learning setting. Consider the classification setting as a basic problem. One needs to find parameters of a function that maps a feature vector to one of the pre-defined class labels $\mathbb{R}^m \to \{c_1, c_2, \dots, c_K\}$. The fundamental property is the classes are unordered and orthogonal. The latest means the probabilities of objects classified as different labels are uncorrelated (some negative correlation is normal since strong classifying of an object as $c_i$ decreases the probability for rest of the labels).

Now consider regression, which is another supervised learning problem where feature vectors map to real values: $\mathbb{R}^m \to \mathbb{R}$. It might be considered a classification problem with infinite number of classes. There are two obvious flaws in this reduction. First, a training set is unlikely to have at least one example for each class. To overcome this one can quantize the codomain and train a classifier over a finite set of class labels. However, it leaves us with the second flaw: the method does not take into account correlation between the possible outcomes. The bins are ordered: the training features that correspond to neighbouring bins should be handled differently from those that correspond to distant bins. That's why some global methods (like linear regression) are used.

The similar situation is in the case of a structured outcome. In this case, one usually has a plenty of possible outcomes, but they are not independent. The techniques of structured learning are applicable when the outcomes have some underlying structure, and the elements of the structure have similar sets of features. Also, it should be possible to estimate how bad the prediction is (often in terms of incorrectly predicted elements), which is called structured loss. The methods allow small deviations from the ideal training outcome, which are possible e.g. because of noise, but penalize the substantial ones (just like regression!). The prediction function (parameters of which are tuned) can thus be formalized as a mapping $\mathbb{R}^{m \times l} \to \{c_1, c_2, \dots, c_K\}^l$.

The possible example is hidden Markov model learning, where the elements are emission potentials and transition potentials, and the loss can be defined as the number of wrong HMM outputs. According to the upper formalization, there are $l$ elements, each represented by $m$ features (in practice, $m$ can vary for different types of elements). Since the labels of transition elements are strictly defined by emission outputs, not every outcome is possible.

Another example of a structured prediction problem is natural language parsing. Given a sentence of a natural language, the corresponding parsing tree is to be constructed. Surprisingly, parsing trees could also be represented as high-dimensional vectors, with constraints applied. To summarize, structure prediction has outcome that is multivariate, correlated and constrained.

Ideally, the parameters of structured prediction should be tuned via likelihood maximization, but this turns out to be intractable due to the need of computing the partition function on each gradient optimization step. That's why the L²-regularized hinge loss is usually minimized. The prediction algorithm is represented as a scoring function over possible outcomes. The task of the learning algorithm is to find the parameters of the scoring function to make it return maximum values for the true outcomes on the training set, and low ones for the instances that are far from optimal. Given that, the margin between good and bad possible outcomes should be maximized (in terms of scoring function value).

You can learn more about structure learning from Ben Taskar's NIPS 2007 tutorial. See also our recent 3DIMPVT paper for an up-to-date survey of MRF/CRF training methods and their applications.

PS. I've installed MathJax equation engine into the blog. Seems to work, huh?

[CFP] GraphiCon-2011 and MSCVSS

2011-03-31T00:36:00.005+04:00

Our lab traditionally organizes GraphiCon, the major (and may be the only) conference in Russia specialized in computer graphics and computer vision. The conference is not very selective, still it has a decent community of professionals behind it. So, if you want to get a quality feedback on your preliminary work, or always wanted to visit Russia, consider submitting a paper by May, 16.

Special offer for young readers of my blog: If it is the place where you learned about the conference and decided to submit a paper, I'll buy you a beer during the conference. :) Consider it another social micro-event.

Another piece of news is primarily for undergrads and PhD students from Russia. Our lab and Microsoft Research organize another event this summer: Computer vision summer school. There are such lecturers as Cristoph Lampert and Andrew Zisserman, among others. Participation is free of charge, and accommodation is provided. Deadline is on April, 30. You should not miss it!

IEEE goes evil?

2011-03-07T23:01:00.007+03:00

In November, the IEEE released a new copyright agreement, that is to be signed by the authors of all papers published by the organization since January 2011. The main novelty is that the authors are not allowed to publish final versions of their papers on-line. Fortunately, we still may post the versions accepted for print (with post-review corrections), which is still great, indeed. You can find the FAQ concerning the new policy here.

It seems the IEEE realizes that nowadays there is almost no need in their printed materials and the digital library as search engines manage to find papers posted by their authors. No doubt that they do not want to restrict dissemination of scientific results, but that is a question of their survival as a major publishing organization. The worst is they use the drug-dealer strategy: first they organize great conferences and journals (the majority of top venues in my field are run by IEEE), and allow authors to re-publish anything on-line ("the first hit for free"), then cut it off. Yes, one still may post an accepted version, but who knows what is their next step?

The IEEE motivate their decision by preserving the value of their database: "the IEEE is better able to track usage of articles for the benefit of authors and journals". One can object that there are big and growing open-access databases like Mendeley, where all the statistics is open and refined (the personal info of users is known). Also, not all institutions have an access to the library. For example, my university (the largest one in Russia) does not. And I cannot afford to pay $25 per paper.

What can we do about that? Sure, it is impossible to stop publish in IEEE venues, although the community can find the way around, as was demonstrated by the founders of JMLR, now the major journal in machine learning, which was created to cancel the overhead of the commercial publishing model. Matt Blaze encourages the researchers not to serve as reviewers for IEEE. As for me, I'm too young (as a researcher) to be a reviewer. But it also has effect on me: I am now preparing a final version for the proceedings published by the IEEE, and I doubt if there is any use in improving my paper (even given a quality review), assuming that majority of researchers will use the accepted version and will never see the revised one.

UPD (Mart 14, 2011). I should have misunderstood the new policy at first, Blaze's post and IEEE terminology somehow misled me. The accepted paper means accepted for print by the IEEE, not the submitted for review. So, you still can use review results to correct the paper and post it on-line, then submit for publishing, where the formatting will probably be adjusted. So, any meaningful part could be reflected in the on-line version. I have corrected the post now, so it may seem weird.

The one about CERN

2011-02-28T23:02:00.000+03:00

About a month ago I visited CERN, arguably the most famous research organization in Europe. It is the place where the World Wide Web has been invented, and the home of the Large Hadron Collider. I was very excited about the trip, much like Sheldon Cooper. Here are some facts which were new to me:

... about CERN:

CERN was founded after the WWII by a bunch of European countries. They were exhausted by the war, and the only way to catch up with the USA and USSR in fundamental science was to join forces.
The original name was Conseil Européen pour la Recherche Nucléaire (European Council for Nuclear Research), which abbreviated to CERN. Later the name has been officially changed to European laboratory for particle physics, which is both more relevant and less fearful for the locals, however the brand CERN is used now even in official documents.
There are almost 3,000 full-time employees, but most of them are engineers and not scientists. There are a lot of visiting researchers though.
There are 20 member states now (primarily EU states), and 6 observer states (such as Russia and the USA).
CERN's annual budget is about € 1 billion, it is funded by the member states in proportion to their economical power, e.g. Germany gives 20% of the money.
The budget money are spent to infrastructure and support, all the individual experiments are funded by research groups and their universities.
In spite of the USA is not a member state, it leads on the number of researchers who work on CERN projects (more than thousand), second is Germany, third is Russia (yes, we still have a good shape in particle physics). It turned out that everybody at CERN spoke Russian, even the janitor. :)

... about the LHC:

It is in fact a circular tunnel of 27 km in circumference lying 175 m beneath the ground.
The tunnel was used before the LHC, it was build in 1983 for the Large Electron-Positron Collider. In 2008 it was upgraded to be able to accelerate heavy particles like protons to become the Large Hadron Collider (remind that proton and neutron are thousand times heavier than electron).
The tunnel is about 4 meters in diameter, one can walk there or ride a bike.
The tunnel encapsulates two small pipes for the particles that intersect at four points (to make the tracks' lengths equal, like in speed scating arenas). There are more then a thousand of electric magnet dipoles along the pipe. They are not that big as I imagined before.
It is nearly vacuum and zero temperature inside the tubes.
Proton beams are not generated inside the LHC. First, they are accelerated in the linear accelerator and almost reach the speed of light c. While the speed is rising, it becomes harder and harder to increase it since it cannot overcome the speed of light. Then they are accelerated in the small circular accelerator, and only after that they are injected into the LHC where during 40 minutes the beams are accelerated to speed as much close to c as possible.
When accelerated, the beams are being observed during 10 hours. They suffer about 10,000 collisions per second, about 20 pairs of protons collide each time. Since the speed is large, the energy of that collisions is enormous.
The collisions take place within special locations called detectors. We visited a control centre of one of them, ATLAS. A detector has multiple layers, each able to register certain kind of particles, like photons.
Ten thousand collision per second would yield really big amount of data, so only few of them are selected to be logged. I don't know how they select those collisions, machine learning might be used. :)
All the collected data are spread into servers all over the world. An authorized researcher may log in to the grid network and execute her script to analyse the data.

... about Higgs boson:

Higgs boson is a hypothetical particle, existence of which would prove the standard model of particle physics.
Higgs boson appears as a result of collision of two protons and large amount of energy. The protons in the LHC are accelerated enough to produce theoretically sufficient energy.
Higgs boson is very heavy and thus unstable. In theory, it decays into either four muons or two photons. So, if there will be two counter-directed light beams registered by the detector, this will be an evidence of the boson. See the picture below for example of likely detector output in case of the boson shows up.
Scientist say that if Higgs boson would not be detected, all the modern knowledge on particle physics will crush. They will be obliged to develop a new theory from scratch.
There are no published results that report on Higgs boson detection so far...

Don't you want to be a theoretical physicist now? =)

Art + Multimedia

2011-01-25T09:04:00.000+03:00

In December we visited a lecture about interaction between arts and sciences, namely between (primarily) visual arts and (again, primarily) multimedia studies (Russian page). The lecture was given by Asya Ablogina, who turned out to be a nice girl. Although she lacked any technical background, she was doing well. Surprisingly, there is a lot of artwork that exploits technical support in a witty manner, which I was unaware of, so the lousy stuff exhibited in the Moscow Museum of Modern Art is not everything one can do in this field!

The most interesting part was Asya's exposition of some masterpieces. Most of them were presented during 2009 Science as Suspense exhibit in Moscow (Russian). Nicolas Reeves is a famous Canadian architect, also known for his work on modelling biological systems. He used the output of biologically-inspired computer algorithms (such as genetic algorithms) to draw pictures. In Moscow he presented a project called ~~Marching~~ Floating Cubes: massive but light cubes float in the air. Their movements are controlled by tiny fans, although any little gust of wind can affect the movement. Each cube is equipped with an on-board computer, which helps to avoid collisions. The implemented algorithms are simple but stochastic, hence the behaviour is unpredictable. They are said to move like animal creatures. Here is the video:

A similar project was developed by Paul Granjon. He also tries to gift robots with animal behaviour. In the video below, robots are sexed, i.e. they are able to locate robots of the opposite sex and eventually end up in coitus. Another Paul's robot (the Smartbot, photo) creeps over the restricted space and always grumbles (just like Marvin!). I also enjoy the way Paul speaks:

The real idol in the sci-art community is Stelarc (see also his homepage, although it takes balls to get through the welcome page :). His talent is recognized by both scientists and artists (it is enough to mention he's an Honorary Professor of Arts and Robotics at Carnegie Mellon). He's best known for his experiments on his own body. For example, during a performance he allowed to control his body remotely over the Internet by muscle stimulation. Probably his favourite project is Prosthetic Head, which is in fact a 3D model of his own head. The head is learned from Stelarc's behaviour and speech, so it is able to communicate with people using colloquial information as well as non-verbal cues. It is interesting to try it in practice, but here is just a non-interactive demo video:

Stelarc is probably the only person on Earth who has three ears. In 2007 surgeons implanted an ear into his arm! It is not functioning now (just a piece of skin), but he plans to install an audio receiver into the ear and broadcast everything he hears over the Internet. I definitely recommend to look through his projects, there are decent ones.

Asya also presented her own works. She focus on different photographic techniques such as overexposure. Here are photos from her project Canvas of the Road compiled in a video. There is a play of words in Russian: the single word for canvas and road surface. On these photos car traces look like painter's strokes:

Another Asya's project is a short film series called Habitat. She showed only one episode with the title Habitat: MSU Main Building. The film was about a girl who is a PhD student in Moscow State University and described her place: a standard 8 sq.m. dorm room. The narrative was full of expressive epithets and metaphors about that awful room: the girl felt like in a cage, she also didn't like shared bathroom, etc. It's funny, because I live in a similar PhD student single room, and I'm quite happy with it, since I've been living 5 years in a shared room before. :)

The second lecturer, Vladimir Vishnyakov, presented his project called The Museum of Revived Photography. He used the photos of the XIX century to create image-based animation. The idea is technically simple: first segment people from the background, then use inpainting techniques to restore the background behind them, and animate the movements, but Vladimir referred to a programmer he works with as a real genius. In fact, all the stuff I post here varies from easy to moderately complex (except of Stelarc's projects), so you can do something similar too. The main problem is to come up with the concept. Come on then! ;)

Comic strip: Nerd's confession

2010-12-22T22:21:00.003+03:00

I continue sharing my Prague comic strips, since I have not been posting for a while. In fact, I was busy writing a paper, so this one may seem somehow connected. Disclaimer: There's nothing personal in the strip. Kids under 18 are not allowed. :)

The next post is going to be about multimedia technologies for art. Stay tuned!

Happy Thaknsgiving!

2010-11-26T10:30:00.004+03:00

Here is a comic strip by Ryan Lake that illustrates the concept of Russell's turkey. The idea is simply the induction does not always work the way you expect. There could be some factors, which did not have influence on the training data, yet changing the learned concept dramatically. We should thus keep in mind that machine learning (which is based on induction) is not the panacea.

Bertrand Russell is well-known for his metaphors. Probably the most famous one is Russell's teapot that aims to show how pointless is to believe in god.

Happy Thanksgiving!

PS. I am reading Animal Farm by George Orwell now, it is a great animal metaphor too.

Nice paper acknowledgement

2010-11-10T21:13:00.002+03:00

It turns out that not only in Russia PhD students could be underpaid so that they cannot afford food:

This is from Olga Veksler's paper "Graph cut based optimization for MRFs with truncated convex priors" from CVPR 2007.

Andrew seems to be a nice person. I should definitely ask Anton Osokin to introduce us on ocasion. =)

Papers, citations, co-authorship, and genealogy

2010-10-10T10:10:00.009+04:00

This summer I accidentally found out that there are two papers citing our CMRT 2009 paper. I was excited a bit about that since they were the first actual citations of me, so I even read those papers.

The first of them [Димашова, 2010] was published in Russian by OpenCV developers from Nizhny Novgorod. They implemented cascade-based face detection algorithm that used either Local Binary Patterns (LBP) or Haar features. The algorithm was released within OpenCV 2.0. They cite our paper as an example of using Random Forest on the stages of the cascade. However, they implemented the classical variation with the cascade over AdaBoost.

Another citation [Sikiric et al., 2010] is more relevant since it came from the road mapping community. They address the problem of recovering a road appearance mosaic from the orthogonal views to the surface. They contrast their approach with ours in the way that theirs do not employ human interaction. In fact, we need human input for recognizing road defects and lane marking, rectification is done in the previous stage, which is fully automatic.

The rest of the post is devoted to some interesting metrics and structures concerning citations, co-authorship and supervising students.

Citations

The number of citations is a weak measure of paper quality. We can also go further and estimate the impact of a journal or a particular researcher based on the number of citations. A generally accepted measure is the journal impact factor, which is simply the mean number of citations by paper published in the journal for some period in time. Individual researchers could be evaluated by the impact factor of the journals they published in, though it is considered as a bad practice. Another not-so-bad practise is h-index. By definition, one's h-index equals H if there are at least H citations of his top H papers¹. It has also been criticised widely.

So, citation-based scoring has a lot of flaws. But what can we use instead? Another interesting approach is introduced by ReaderMeter. They collect the information about the number of people who have added some paper to their Mendeley collections and compute something like h-index. Unfortunately, they recently excluded some papers from their database, so the statistics became less representative but more accurate.

Co-authorship and Erdős number

Paul Erdős was a Hungarian mathematician who published about 14 hundred research papers with 511 different co-authors. That's why he has a special role in bibliometrics. Erdős number is defined as a collaborative distance from a researcher to Erdős. More strictly, Wikipedia defines:

Paul Erdős is the one person having an Erdős number of zero. For any author other than Erdős, if the lowest Erdős number of all of his coauthors is k, then the author's Erdős number is k + 1.

My Erdős number is at most 7 (via Olga Barinova, Pushmeet Kohli, Philip H.S. Torr, Bernhard Schölkopf, John Shawe Taylor, David Godsil). To be honest, Erdős number system is primarily used for math papers. Even if we use the wider definition, it is not that beautiful, because there are not too many gates, e.g. all the vision community will probably connect to Erdős via the machine learning gate. Thus, the researchers who work on the edge will be closer. To illustrate this fact, David Marr probably had not any finite Erdős number during his lifetime (now he definitely has). So, we can introduce, say, Zisserman number for computer vision.² According to DBLP, Andrew Zisserman has 165 direct co-authors so far. Now, my Zisserman number is 3, David Marr's is 4 (via Tomaso Poggio, Lior Wolf, Yonatan Wexler).

The movie industry have their own metrics, which is the Bacon number (after Kevin Bacon). The distance is established 1 if two actors have appeared in a single movie. Someone tried to combine those two to the Erdős-Bacon number. For some person, it is just a sum of her Erdős and Bacon numbers. Of course, very few people have a finite Erdős-Bacon number, since one should both appear in a movie and publish a research paper (and it is still not sufficient). Often they are possessed by researchers who consulted the filming crew and accidentally were filmed. =) Erdős himself has this number 3 or 4 (depending on the details of definition), since he starred in N is a Number (1993) and his Bacon number is thus 3 0r 4.

The person with the probably lowest Erdős-Bacon number is Daniel Kleitman, an MIT mathematician, who has appeared in one of my favourite movies Good Will Hunting (1997) along with Minnie Driver, who collaborated with Bacon in Sleepers (1996). Since Kleitman has 6 joint papers with Erdős, his Erdős-Bacon number is 3.

Surprisingly, Marvin Minsky has his Erdős number (4) greater than his Bacon number (2), which he obtained via The Revenge of the Dead Indians (1993) and Yoko Ono. Another strange example is ~~the paedophile's dream~~ Natalie Portman. She has graduated from Harvard, saying she would rather "be smart than a movie star". That's my kind of a girl! A neuroscience paper [Baird et al., 2002] brought Natalie Hershlag (her real name) Erdős number of 5, and then she appeared in New York, I Love You (2009) along with Kevin Bacon, so she reached the same Erdős-Bacon number as Minsky, i.e. 6.

UPD (Mart 13, 2011). There is a totally relevant xkcd strip.

Scientific genealogy

Finally, I tell about what is known as scientific genealogy. Every grown-up researcher has a PhD advisor, usually one, while an advisor can have a lot of students. Let's just use the analogy with parents and children. We get a tree (or a forest) representing the historical structure of science. The mathematics genealogy project aims to recover this structure.

I tried to track my genealogy back. Unfortunately, I failed to find who was Yury M. Bayakovsky's PhD advisor. But if consider Olga Barinova my advisor, I am the 11th generation descendant of Carl Friedrich Gauß. Nice ancestry, huh?

The similar project exists for computer vision. There are 290 people in the base, though there are duplicates (I've found five Vitorio Ferrari's :). It seems strange that some trees have depth as big as 5 (e.g. Kristen Graumann and Adriana Quattoni are the 5th generation after David Marr), though vision is a relatively young field.

¹ Okay, take the supremum of the set if you want to stay formal
² It seems that Philip Torr has already used that number.

LidarK 2.0 released

2010-09-16T16:39:00.002+04:00

The second major release of GML LidarK is now available. It reflects our 3-year experience on 3D data processing. The description from the project page:

The LidarK library provides an open-source framework for processing multidimensional point data such as 3D LIDAR scans. It allows building a spatial index for performing fast search queries of different kinds. Although it is intended to be used for LIDAR scans, it can be helpful for a wide range of problems that require spatial data processing.

The API has been enriched with various features in this release. Indeed, it became more consistent and logical. New ways to access data (i.e. various iterators) are implemented. One can now find k nearest neighbours for any point, not just for one that belongs to the index. Since the data structure is a container, we've decided to parametrize it with template parameter. This decision is controversive: one does not need to cast tuples any more, but the code became clumsier and less portable, unfortunately.

The C++/MATLAB code is licensed free of charge for academic use. You can download it here.

ECCV 2010 highlights

2010-09-11T22:51:00.002+04:00

Today is the last day of ECCV 2010, so the best papers are already announced. Although I was not there, a lot of papers are available via cvpapers, so I eventually run into some of them.

Best papers

The full list of the conference awards is available here. The best paper is therefore "Graph Cut based Inference with Co-occurrence Statistics" by Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip Torr. The second best is "Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics" by Abhinav Gupta, Alyosha Efros, and Martial Hebert. Two thoughts before starting reviewing them. First, the papers come from the two institutions which are (arguably) considered now the leading ones in the vision community: Microsoft Research and Carnegie-Mellon Robotics Institute. Second, both papers are about semantic segmentation (although the latter couples it with implicit geometry reconstruction); Vidit Jain already noted the acceptance bias in favour of the recognition papers.

Okay, the papers now. Ladicky et al. addressed the problem of global terms in the energy minimization for semantic segmentation. Specifically, their global term deals only with occurrences of object classes and invariant to how many connected components (i.e. objects) or individual pixels represent the class. Therefore, one cow on an image gives the same contribution to the global term as two cows, as well as one accidental pixel of a cow. The global term penalizes big quantity of different categories in the single image (the MDL prior), which is helpful when we are given a large set of possible class labels, and also penalizes the co-occurrence of the classes that are unlikely to come along with each other, like cows and sheep. Statistics is collected from the train set and defines if the co-occurrence of the certain pair of classes should be encouraged or penalized. Although the idea of incorporating co-occurrence to the energy function is not new [Torralba et al, 2003; Rabinovich et al., 2007], the authors claim that their method is the first one which simultaneously satisfy the four conditions: global energy minimization (implicit global term rather than a multi-stage heuristic process), invariance to the structure of classes (see above), efficiency (not to make the model order of magnitude larger) and parsimony (MDL prior, see above).

How do the authors minimize the energy? They restrict the global term to the function of the set of classes represented in the image, which is monotonic w.r.t. argument set enclosing (more classes, more penalty). Then the authors introduce some fake nodes for αβ-swap or α-expansion procedure, so that the optimized energy remains submodular. It is really similar to how they applied graph-cut based techniques for minimizing energy with higher-order cliques [Kohli, Ladicky and Torr, 2009]. So when you face some non-local terms in the energy, you can try something similar.

What are the shortcomings of the method? It would be great to penalize objects in addition to classes. First, local interactions are taken into account, as well as global ones. But what about the medium level? Shape of an object, size, colour consistency etc. are the great cues. Second, on the global level only inter-class co-occurrences play a role, but what about the intra-class ones? It is impossible to have two suns in a photo, but it is likely to meet several pedestrians walking along the street. It is actually done by Desai et al. [2009] for object detection.

The second paper is by Gupta et al., who have remembered the romantic period of computer vision, when the scenes composed of perfect geometrical shapes were reconstructed successfully. They address the problem of 3D reconstruction by a single image, like in auto pop-up [Hoiem, Efros and Hebert, 2005]. They compare the result of auto pop-up with Potemkin villages: "there is nothing behind the pretty façade." (I believe this comparison is the contribution of the second author). Instead of surfaces, they fit boxes into the image, which allows them to put a wider range of constraints to the 3D structure, including:

static equilibrium: it seems that the only property they check here is that centroid is projected into the figure bearing;
enough support force: they estimate density (light -- vegetation, medium -- human, heavy -- buildings) and say that it is unlikely that building is build on the tree;
volume constraint: boxes cannot intersect;
depth ordering: backprojecting the result to the image plane should correspond to what we see on the image.

This is a great paper that exploits Newtonian mechanics as well as human intuition, however, there are still some heuristics (like the density of a human) which could probably be generalized out. It seems that this approach has a big potential, so it might became the seminal paper for the new direction. Composing recognition with geometry reconstruction is quite trendy now, and this method is ideologically simple but effective. There are a lot of examples how the algorithm works on the project page.

Funny papers

There are a couple of ECCV papers which have fancy titles. The first one is "Being John Malkovich" by Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steve Seitz from the University of Washington GRAIL. If you've seen the movie, you can guess what is the article about. Given the video of someone pulling faces, the algorithm transforms it to the video of John Malkovich making similar faces. "Ever wanted to be someone else? Now you can." In contrast to the movie, in the paper not necessarily John Malkovich plays himself: it could be George Bush, Cameron Diaz, John Clooney and even any person for whom you can find a sufficient video or photo database! You can see the video of the real-time puppetry on the project page, although obvious lags take place and the result is still far from being perfect.

Another fancy title is "Building Rome on a Cloudless Day". There are 11 (eleven) authors contributing to the paper, including Marc Pollefeys. This summer I spent one cloudless day in Rome, and, to be honest, it was not that pleasant. So, why is the paper called this way then? The paper refers to another one: "Building Rome in a Day" from ICCV 2009 by the guys from Washington again, which itself refers to the proverb "Rome was not built in a day." In this paper authors build a dense 3D model of some Rome sights using a set of Flickr photos tagged "Rome" or "Roma". Returning back to the monument of collective intelligence from ECCV2010, they did the same, but without cloud computing, that's why the day is cloudless now. S.P.Q.R.

I cannot avoid to mention here the following papers, although they are not from ECCV. Probably the most popular CVPR 2010 paper is "Food Recognition Using Statistics of Pairwise Local Features" by Shulin Yang, Mei Chen, Dean Pomerleau, Rahul Sukthankar. The first page of the paper contains the motivation picture with a hamburger, and it looks pretty funny. They insist that the stuff from McDonald's is very different from that from Burger King, and it is really important to recognize them to keep track of the calories. Well, the authors don't look overweight, so the method should work.

The last paper in this section is "Paper Gestalt" by the imaginary Carven von Bearnensquash, published in Secret Proceedings of Computer Vision and Pattern Recognition (CVPR), 2010. The authors (presumably from UCSD) make fun of the way we usually write computer vision papers assuming that some features might convince a reviewer to accept or reject the paper, like mathematical formulas that create an illusion of author qualification (although if they are irrelevant), ROC curves etc. It also derides the attempts to apply black-box machine-learning techniques without the appropriate analysis of the possible features. Now I am trying to subscribe to the Journal of Machine Learning Gossip.

Colleagues

There was only one paper from our lab at the conference: "Geometric Image Parsing in Man-Made Environments" by Olga Barinova and Elena Tretiak (in co-authorship with Victor Lempitsky and Pushmeet Kohli). The scheme similar to the image parsing framework [Tu et al., 2005] is utilized, i.e. top-down analysis is performed. They detect parallel lines (like edges of buildings and windows), their vanishing points and the zenith jointly, using a witty graphical model. The approach is claimed to be robust to the clutter in the edge map.

Indeed, this paper could not have been possible without me. =) It was me who convinced Lena to join the lab two years ago (actually, it was more like convincing her not to apply for the other lab). So, the lab will remember me at least as a decent selectioner/scout...

Comic strip: 10%

2010-08-27T22:49:00.006+04:00

Last summer, during my internship at CMP in Prague I drew some xkcd-style comic strips, and I want to share some of them here. I considered starting a different blog for that, but there are too few of them, and they do suck (at least in comparison to Randall's work). However, most of them are programming/research related, so they are appropriate here.

I planned to use them when I would have nothing to post about. This is not the case now (I am planning a couple more posts in a week or so), but this one is topical. It was originally drawn in September 1, 2009 when GMail was not operating for a few hours. From the recent news: Google shuts Google Wave down, admitting their failure predicted by some sceptics (see the image title text).

Theorem 8.10. P ≠ NP.

2010-08-19T01:49:00.008+04:00

Nearly two weeks ago Indian-American mathematician Vinay Deolalikar, the employee of HP Labs, sent a letter to a group of mathematicians asking them to proof-read his attempt to prove P ≠ NP. The pre-print was eventually distributed over the Internet, so a lot of people became aware. Later two major flaws were found in the proof, so it is now considered wrong. Nevertheless, Vinay has removed the paper from his page and commented that he fixed all the issues and is going to submit a paper to the journal review. You can download the original paper from my web-site [Deolalikar, 2010].

We'll see if he is bluffing or not. In any case, it is very pleasant to me that the main role in the proof is played by... the graphical probabilistic model framework. Yeah, those graphical models we use in computer vision! It was surprising indeed, since the problems in computational complexity are usually discrete and well-posed. So, this post is devoted to understanding why a graphical model emerges in the proof.

First, I am going to review some key points of the proof. It is far from being exhaustive, just some rough sketches that are critical for understanding. Then I report on the flaws found in the proof, and then I end up with a methodological note on the whole P vs. NP problem.

k-SAT and d1RSB Phase

In order to prove P ≠ NP, it is enough to show the absence of a polynomial algorithm for some problem from the class NP, e.g. for some NP-complete problem (which are the "hardest" problems in NP). The boolean satisfiability problem (SAT) addresses the issue of checking if there is at least one assignment of the boolean variables in the formula, represented in conjunctive normal form (CNF), which makes it TRUE. The k-satisfiability problem (k-SAT) is a particular case of SAT, where all the clauses in the CNF have order k. For example, (x˅y)&(¬x˅y)&(x˅¬y)&(¬x˅¬y) is an instance of 2-SAT which has the answer 'NO', since any boolean values of (x, y) makes the formula false. (x˅y˅z)&(¬x˅u˅v)&(¬x˅u˅¬z) is a 3-SAT over 5 variables (x, y, z, u, v), which is satisfiable, e.g. on the input (1,0,0,1,0). 2-SAT is in P, but k-SAT is NP-complete whenever k > 2. The Deolalikar's idea is to show that k-SAT (with k > 8) is outside of P, which (if true) proves that there is a separation between P and NP.

The space of possible k-SAT solutions is analysed. Let m be the number of clauses in the CNF, n be the number of variables. Consider the ratio α = m/n. In the border case m = 1 (α is small), k-SAT is always true, there are usually a lot of inputs that satisfy the CNF. Consider an ensemble of random k-SATs (the following is statistical reasoning). With the growth of α, the probability of the CNF to be satisfiable decreases. When a certain threshold α_d is reached, the "true" solutions set breaks into clusters. This stage is known in statistical mechanics as dynamical one step replica symmetry breaking (d1RSB) phase. Moving further, we make the problem completely unsatisfiable. It turns out that d1RSB stage subproblems are the most challenging of the all k-SAT problems. The effect could be observed only if k > 8, that's why such problems are used in the proof. [Deolalikar, 2010, Chapter 6]

FO(PFP) and FO(LFP)

In the finite model theory there are recurrent extensions of the first-order logic. The predicate P_i(x) is evaluated as some second-order function φ(P_i-1, x), where x is a k-element vector, P₀(x) ≡ false. In the general case, either there is a fixed point, or is P_ilooping. For example, if φ(P, x) ≡ ¬P(x), then P₀(x) ≡ false, P₁(x) ≡ true, P₂(x) ≡ false, etc. Here x is meaningless, but it is not always the case. Consider the following definition: φ(P, x) ≡ max(x) ˅ P(x) // recall x is a vector, in the boolean case 'max' is the same as 'or'. If x contains at least one non-zero element, P₀(x) = false, P₁(x) = true, P₂(x) = true, etc. Otherwise, P_i(0) = false for all i. In the case of looping, let's redefine the fixed point to be constantly false. FO(PFP) is a class of problems of checking if there will be a loop for some x, or a fixed point. They are actually very difficult problems. FO(PFP) is equal to the whole PSPACE. Suppose now that φ is monotonic on P. It means that P(x) only appears in the formula with even number of negations before it (or zero, as in the second example). This means that once P_i(x) is true, P_j(x) will be true for any j > i. This reduces the class to FO(LFP), which is proven to be equal to the class P. So, the global problem now is to show that k-SAT is outside of the class FO(LFP).

ENSP: the graphical model for proving P ≠ NP

So how a graphical model emerges here? Graphical model is a way to describe a multivariate probability distribution, i.e. dependencies of covariates in the distribution. Recall now the definition of NP. If we somehow guess the certificate, we are done (i.e. have a polynomial algorithm). If the space of certificates (possible solutions, in terms of Deolalikar) is simple, we can probably get a polynomial algorithm. What is simple? This means that the distribution is parametrizable with a small amount of parameters (c.f. intrinsic dimensionality), which allows us to traverse the solution space efficiently. This is twofold. First, the distribution is simple if it has a limited support, i.e. all the probability measure is distributed among a limited number of points of the probabilistic space. Second, it is simple if the covariates are as much conditionally independent as possible. In the ideal case, if all the groups of covariates are independent (recall that pairwise and mutual independence do not subsume each other!), we need only n parameters to describe the distribution (n is the number of covariates), while in general case this number is exponential. See [Deolalikar, 2010, p. 6] for examples.

How to measure the degree of conditional independence? Yes, to build a graphical model. It is factorized to cliques according to Hammersley-Clifford theorem. Larger cliques you get, stronger dependency is. When the largest cliques are k-interactions, the distribution can be parametrized with n2^k independent parameters. Finally, Vinay shows that FO(LFP) can deal only with the distributions parametrizable with 2^{poly(log n)} parameters, which is not the case for d1RSB k-SAT (its solution space is too complicated).

In order to show it strictly, Deolalikar introduces a tailored graphical model describing LO(LPF) iterative process, the Element-Neighborhood-Stage Product model (ENSP):

There are two types of sites: corresponding to the variables on the stages of LFP (small circles) and corresponding to the elements of witty neighbourhood system (some closure over Gaifman graph; big blue circles). When a variable is assigned 0 or 1, it is painted with red or green respectively. The last column corresponds to the fixed point, all the variables are assigned. Thus, this model is easily factorizable to the small cliques, and it cannot describe the imaginable LFP process for some k-SAT. See [Deolalikar, 2010, Chapter 8] for the details.

So what's the problem?

Neil Immerman, an expert in the finite model theory, noticed two major flaws. The first one is that Vinay actually model only monadic LFP, which is not equal to P, as assumed. He thus proved that there is a gap between NP and some subclass of P, which is not obligatory equals P. The second major flow deals with modelling k-SAT as a logical structure. I do not fully understand this step, but some order-invariance assumptions are wrong. Here is a discussion in Lipton's blog. According to it, the flaws are really fatal.

The third way

It seems that it is really difficult to prove P ≠ NP. The most of scientists think that P = NP is improbable, and that's why the hypothesis P ≠ NP is generally adopted now. But there is one more option: neither P = NP nor P ≠ NP is provable. As every mathematical theory, computational complexity is a set of axioms and statements, which are usually could be proved to be correct or not. But there are sometimes some statements formulated in terms of the theory, which could not. Moreover, according to the first Gödel's incompleteness theorem, in any mathematical theory based on our natural understanding of natural numbers, there is either a statement that could be proved both true or false, or an unprovable one. I know no actual reasons why this could not be the case for P ≠ NP (although I have feelings there are some since it is not usually taken into account).

Suppose it is really so. This would mean that the whole theory should be reconsidered. Some axioms or definitions will change to make this fact provable. But may be anything else will become unprovable. Who knows...

Image Parsing: Unifying Segmentation, Detection, and Recognition

2010-08-15T22:38:00.002+04:00

Recently I have posted about the connection between object category detection and semantic image segmentation. While delving into this problem more deeply, I have found the paper denoted in the title of this post.

The ICCV 2003 paper by Tu et al. was among the three Marr prize winners. And it is really a prominent piece of work! In the introduction to the special Marr prize IJCV issue, Bill Triggs featured the paper as one that "would have been particularly pleasing to David Marr", because of "its bold attack on the central problem of perceptual organization and whole scene understanding". Here is the journal version of the paper.

The authors combined discriminative and generative models, which resulted to the unified framework for image parsing. For each image the corresponding parsing tree could be found. The root of the tree represents the whole scene. On the next level, nodes represent semantic fragments, such as human faces, text, or textured regions. The leaves of the tree correspond to the pixels of the image.

The approach differs drastically from the vanilla CRF framework in the way that the structure of the parsing tree is dynamic while the CRF structure remains constant and just interaction models between the pairs of sites may change. The goal is to obtain the tree that maximizes the posterior probability given the image. The directed search in the space of valid trees is performed by means of Markov chain Monte-Carlo (MCMC). The possible tree changes like split and merge of regions, varying the border between regions, are defined. Such changes are described in terms of learnable Markov chain transition kernels.

What they actually did is they effectively combined top-down and bottom-up approaches. In a nutshell, the bottom-up (discriminative) method generates the hypotheses of pixel labels and object positions using the local neighbourhood, and the top-down generative model is then built making use of those hypotheses. The latest guarantees the consistency of the output and the optimum of its posterior probability.

Okay, what does it mean for the issue of detection and segmentation convergence? Because the shapes of objects are determined implicitly during the recognition, and stuff regions are detected by their colour and texture, the problem of semantic image segmentation is actually solved. Moreover, multi-scale semantic segmentation is performed during the parsing. Therefore, image parsing seems to be the most general formulation of image recognition problem.

The nodes of a parsing tree, which belong to the same depth level, are not connected (right, because it is a tree). This means that spatial interactions could not be modelled directly. A totally different approach was introduced in another Marr prize winning paper by Desai et al [2009]. They also use bottom-up tests to generate hypotheses on the object locations and then use CRF over those location to take spatial context into account. They model different kinds of inter-class and intra-class interactions. Thus, they ensure that the objects in the image (described by their bounding boxes) are arranged correctly, e.g. a cup is likely to be on a table (spatial arrangement), while a train could not be close to a ship (mutual exclusion).

It seems that the two papers exploit the two different forms of information (aggregation vs. spatial interactions), which are mutually redundant to a certain extent, but do not completely exhaust each other. It seems that the group of researchers who will successfully combine those approaches will receive some 201x Marr prize.

ICVSS 2010: some trivia

2010-07-28T00:41:00.010+04:00

If you think that summer schools are only about lectures and posters, you are certainly wrong. Well, I also did not expect that I would get drunk every night (it was not in Russia after all!). On the first night I convinced people to go to the beach for night swimming, but on the next day I realized that it was not a really good idea, because lectures started at 9. Eventually, we did not go to bed early for the rest of the week.

I would like to thank "Marsa Sicla" people (Fig. 1), you were the great company!

Fig. 1. The Marsa Sicla company. Left to right: Richard, Ramin,
Nicolas, Xi, Vijay, Clément, Aish, me. Sandra is behind the camera.

Actually, there was a reason why we had such a close-knit company. On the Fig. 2 you can see the map of the region where the school took place. The school was hosted by Baia Samuele hotel village (inside the blue frame on the map). The cheapest accommodation at Baia Samuele costed €100 per night, so some pragmatic people scared a different accommodation up. It was in a neighbouring hotel village, Marsa Sicla.

Fig. 2. ICVSS location map. View ICVSS_2010_MSvsBS in a larger map

In a map it looked pretty close, but in practice it turned out that there were a lot of fences (blue line)! Officially, we had to walk along the red route (~2.5 km, partly along a highway) through the official entrance. Moreover, we had to leave the village for lunch (since there were buffet), otherwise we would have been charged €30 per meal. To control it they wanted us to leave our documents at the control post (bottom blue tick on Fig. 2). Thus, we were supposed to walk 2.5 km four times a day. But we found a better solution.

Although the fence was equipped with a barbwire, we managed to find two shortcuts (green ticks). One of them led through ever-closed gate (Fig. 3), which were suitable for hopping over. So, we used the green path usually. It made the way two times shorter. Moreover, we did not have to leave Baia Samuele for lunch, since we weren't officially there. Sometimes we even had free lunch, that finally proved that the no free lunch theorem is wrong (joke by Ramin).

Fig. 3. Hopping over the fence.

Well, the lunch was really great. But this was not really a question of food, the main reason not to leave Baia Samuele was of course socialization, which used to be active during lunch (and also Internet access at the conference centre :). If we knew about such situation before booking the apartments, we would probably followed the official accommodation recommendations. However, we had a great company at Marsa Sicla too. But because of "unofficial night programme" I had to (almost) sleep during lectures. Most of guys was on their last year of PhD, so they considered the school as a vacation. But I really wanted to learn a lot of stuff. It was really difficult to perceive any information when you had not slept enough time. Surprisingly, I somehow managed to pass the final exam, but I feel I need to look through the lecture slides again, reading up some papers they refer to, otherwise the scientific part of the school will be useless for me.

Thus, I finish posting about ICVSS. To conclude, I want to recommend everyone to visit such summer schools, because they are the best way to enter into the community.