## Launching Jupyter notebook server on startup

In this post, I show how to automatically run the Jupyter notebook server on an OSX machine on startup, without using extra terminal windows.

#### Background

Jupyter notebooks provide a useful interactive environment (think better alternative for console) for scripting languages like Python, Julia, Lua/torch, which has become a favourite tool of many data scientists for handling datasets that fit into the laptop’s memory. If you have not used it, try it next time you do data analysis or prototype an algorithm.

#### The Pain

There is a slight technical difficulty, though. The way Jupyter works is you run a server process on your machine (or somewhere in the network), and then access it through a web client. I have noticed a popular pattern of usage in the local-server case: a user runs the command to start the server process in the terminal then keeps the terminal window open while she is using the notebook. This method has several flaws:

• you need to run the process manually each time you need a notebook (after reboot);
• you have to remember the command line arguments;
• the process takes up a terminal window/tab, or, if you run it in the background, you can accidentally close the tab and your RAM state will be lost.

#### The Solution

If your perfectionist’s feelings are hurt by those nuisances above, my setup may remedy it. It uses launchd framework that starts, stops, and manages background processes.

First, create a property list file in your ~/Library/LaunchAgents directory:

~/Library/LaunchAgents/com.blippar.jupyter-ipython.plist

and put the following text there. This is a custom format, which is not quite XML (e.g. order of children nodes apparently matters). I highlighted red the paths and other variables you will have to change:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.blippar.jupyter-ipython</string>
<key>ProgramArguments</key>
<array>
<string>zsh</string>
<string>-c</string>
<string>source ~/.zshrc;workon py3;jupyter notebook --notebook-dir /Users/roman/jupyter-nb --no-browser</string>
</array>
<true/>
<key>StandardErrorPath</key>
<string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stderr</string>
<key>StandardOutPath</key>
<string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stdout</string>
</dict>
</plist>

The most interesting part is the ProgramArguments entry. As it is impossible to specify several commands as an argument, I had to run a separate shell instance and pass the commands separated by semicolons. If you do not need to set the virtual environment (workon py3), you might get away without calling the shell, although I recommend always using virtualenv for python on a local machine. Also, I use zsh; replace it with your shell of choice.

To check if everything works, run
launchctl start com.blippar.jupyter-ipython

and open http://localhost:8888/tree in your browser. Hopefully, you see the list of available notebooks.

Linux users, you should be able to replicate this with systemd (although, chances are you already know that =).

## Asynchronous Fitter design pattern for training in IPython notebooks

[Cross-posted from CodeReview@SX]

### The problem

When you work with Python interactively (e.g. in an IPython shell or notebook) and run a computationally intensive operation like fitting a machine-learning model that is implemented in a native code, you cannot interrupt the operation since the native code does not return execution control to the Python interpreter until the end of the operation. The problem is not specific to machine learning, although it is typical to run a training process for which you cannot predict the training time. In case it takes longer that you expected, to stop training you need to stop the kernel and thus lose the pre-processed features and other variables stored in the memory, i.e. you cannot interrupt only a particular operation to check a simpler model, which allegedly would be fit faster.

### The solution

I propose an Asynchronous Fitter design pattern that runs fitting in a separate process and communicates the results back when they are available. It allows to stop training gracefully by killing the spawned process and then run training of a simpler model. It also allows to train several models simultaneously and work in the IPython notebook during model training. Note that multithreading is probably not an option, since we cannot stop a thread that runs an uncontrolled native code.

Here is a draft implementation:
from multiprocessing import Process, Queue
import time

class AsyncFitter(object):
def __init__(self, model):
self.queue = Queue()
self.model = model
self.proc = None
self.start_time = None

def fit(self, x_train, y_train):
self.terminate()
self.proc = Process(target=AsyncFitter.async_fit_,
args=(self.model, x_train, y_train, self.queue))
self.start_time = time.time()
self.proc.start()

def try_get_result(self):
if self.queue.empty():
return None

return self.queue.get()

def is_alive(self):
return self.proc is not None and self.proc.is_alive()

def terminate(self):
if self.proc is not None and self.proc.is_alive():
self.proc.terminate()
self.proc = None

def time_elapsed(self):
if not self.start_time:
return 0

return time.time() - self.start_time

@staticmethod
def async_fit_(model, x_train, y_train, queue):
model.fit(x_train, y_train)
queue.put(model)

### Usage

It is easy to modify a code that uses scikit-learn to adopt the pattern. Here is an example:
import numpy as np
from sklearn.svm import SVC

model = SVC(C = 1e3, kernel='linear')
fitter = AsyncFitter(model)
x_train = np.random.rand(500, 30)
y_train = np.random.randint(0, 2, size=(500,))
fitter.fit(x_train, y_train)

You can check if training is still running by calling fitter.is_alive() and check the time currently elapsed by calling fitter.time_elapsed(). Whenever you want, you can terminate() the process or just train another model that will terminate the previous one. Finally, you can obtain the model by try_get_result(), which returns None when training is in progress.

### The issues

As far as I understand, the training set is being pickled and copied, which may be a problem if it is large. Is there an easy way to avoid that? Note that training needs only read-only access to the training set.

What happens if someone loses a reference to an AsyncFitter instance that wraps a running process? Is there a way to implement an asynchronous delayed resource cleanup?

## X Bridges of Königsberg

The recent Spiked Math comic strip proposes some cheats to build an Eulerian path over the famous bridges of Königsberg, which was originally impossible. Here is the strip:

It turns out that in practice another solution worked out (in fact, the mix of the engineer’s and mythbuster’s solutions). You can ruin the city during a world war and then transfer it to Russians, who will build the new bridges instead. I’ve been to Kaliningrad (the city has been renamed too) two years ago, but did not try to solve the Euler’s problem in field. Let’s find out if it is possible now. Look at the modern map:

View Königsberg bridges in a larger map

So, the modified Spiked Math’s scheme looks as follows (the new bridges are shown in light blue):

Is there an Eulerian path nowadays, i.e. can you walk through the city and cross each bridge once and only once? Recall the theorem:
A connected graph allows an Eulerian path if and only if it has no more than two vertices of an odd degree.
In the picture above, two pieces of land have even number of incident bridges, while the Kneiphof island in the center has degree 3, and the left bank (in the bottom) has degree 5. This means that now Eulerian paths exist. Any such path should begin in the left bank and end in the island, or vice-versa. It may be a good idea to finish your journey near the grave of Immanuel Kant. :) However, there is no Eulerian conduit, i.e. closed Eulerian path.

PS. 1. If you zoom out the above map, you’ll see one more piece of land and two more bridges on the East. However, it only adds a vertex of degree 2, and swaps parity of the banks, so the properties of the new graph are similar, you just need to start/finish your journey in the right bank, not the left one (and walk 10 km more).
2. When I was preparing this article, I found a similar analysis. It used an old map, so the Jubilee bridge and the 2nd Estacade bridges were not marked (they were built in 2005 and 2012, respectively), and also ignored the Reichsbahnbrüke (railroad bridge, but seems pedestrian-accessible).
3. It would be more interesting to analyze the bridge graphs of other cities with more islands like St. Petersburg (several hundred bridges) or Amsterdam (numerous bridges/canals, but have structure).
4. Here is Euler’s original paper (in Latin) with his drawings of the above schemes.

Last week Max Gubin gave a talk on how Facebook exploits machine learning. The talk was more technical then one might had expected, so I share some interesting facts in this posts. I hope it doesn’t disclose any secret information. :)

Max started with a screenshot of the Facebook interface, where each element was highlighted as beneficial of machine learning. Just to name some, they use learning to predict the order of stories in the newsfeed, groups/ads/chat contacts to show you, and even occasions when your account is supposedly hacked, so you need to verify your identity. For most of the tasks learning is probably trivial, but at least two of them involve complicated algorithms (see below).

Facebook engineers face number of difficulties. A user expects to load a page almost instantly, though network infrastructure already imposes some lag. To avoid further delaying, prediction should be done in tens of microseconds. Moreover, half a billion of daily-active users send a lot of queries, so massively-parallel implementations would be too expensive. They have no choice other than sticking to linear models. For example, they train a linear fitness function to rank the stories for the newsfeed (using e.g. hinge loss or logistic loss). It should be trained to satisfy multiple criteria, often contradicting. For example, maximizing personal user experience (showing most interesting stories) might hurt experience of other users (if one has few friends, they are the only users who can read his/her posts) or degrade the system as a whole (showing certain types of news might be not really interesting to anyone, while necessary to improve connectivity of the social network). Those criteria should be balanced in the learning objective, and the coefficients are changing over time. Even the personal user experience cannot be measured easily. The obvious thing to try is to ask users to label interesting stories (or use their Likes). However, such tests are always biased: Facebook tried to use this subjective labelling three times, and all of them were unsuccessful. Users just don’t tell what they really like.

Another challenge is the quickly-changing environment. For example, interest to specific ads may be seasoned. In advertising, one of the strategies is to maximize the click-through rate (CTR). The model for personalized ads should be able to learn online to adapt to changes efficiently. They use probit regression, where online updates can be written in a closed form, unlike to logistic regression (note the linear model again!). It is based on Microsoft’s TrueSkill™ method for learning ranks of players to find good matches and seems similar to what Bing uses for CTR maximization [Graepel et al., 2010].

Finally, Max mentioned the problem of estimating new features. The common practice in the industry is A/B testing, where a group of users is randomly selected to test some feature, and the rest of users are treated as the control group. Then they compare the indicators for those two groups (e.g. average time spent on the website, or clicks made on the newsfeed stories) and apply statistical tests. As usual, samples are typically small. For example, if they want to test a feature for search in Chinese, they take a small group of 10 million users, and hope that some of them will query in Chinese (recall that Facebook is unavailable in China). It is typically hard to prove a statistically significant improvement.

It was partially a hiring event. If you are looking for an internship or a full-time job, you may contact to their HR specialist in Eastern Europe Marina. Facebook also keeps in touch with universities, e.g. invites professors to give talks in their office or develop joint courses. Professors may apply, but I don’t know a contact for that.

## PGM-class and MRF parameter learning

I’m taking Stanford CS 228 (a.k.a. pgm-class) on Coursera. The class is great, I guess it provides close to the maximum one can do under the constraints of remoteness and bulkness. The thing I miss is theoretical problems, which were taken aside from the on-line version because they could not be graded automatically.

There is an important thing about graphical models I fully realized only recently (partly due to the class). This thing should be articulated clearly in every introductory course, but is often just mentioned, probably because lecturers consider it obvious. The thing is there is no probabilistic meaning of MRF potentials whatsoever. The partition function is there not only for amenity: in contrast to Bayesian networks, there is no general way to assign potentials of an undirected graphical model to avoid normalization. The loops make it impossible. The implication is one should not assign potentials by estimating frequencies of assignments to factors (possibly conditioned on features) like I did earlier. This is quite a bad heuristic because it is susceptible to overcounting. Let me give an example.

For the third week programming assignment we needed to implement a Markov network for handwriting OCR. The unary and pairwise potentials are somewhat obvious, but there was also a task to add ternary factors. The accuracy of the pairwise model is 26%. Mihaly Barasz tried to add ternary factors with values proportional to trigram frequencies in English, which decreased performance to 21% (link for those who have access). After removing pairwise factors, the performance rose to 38%. Why has the joint model failed? The reason is overcounting evidence: different factor types enforce the same co-occurrences, thus creating bias towards more frequent assignments, and this shows it can be significant. Therefore, we should train models with cycles discriminatively.

One more thought I’d like to share: graphical model design is similar to software engineering in the way that the crucial thing for the both is eliminating insignificant dependencies on the architecture design stage.

## When and why it is safe to cast floor/ceil result to integer

Happy new year, ladies and gents! I am back, and the following couple of posts are going to be programming-related.

The C/C++ standard library declares floor function such that it returns a floating-point number:
     double floor (      double x );
float floor (       float x );  // C++ only
I used to be tortured by two questions:

1. Why would the floor function return anything other than integer?
2. What is the most correct way to cast the output to integer?

At last I figured out the answer to both questions. Note that the rest of the post applies to the ceil function as well.

For the second point, I knew the common idiom was just type-casting the output:

double a;
int intRes = int(floor(a));  // use static_cast if you don't like brevity
If you’ve heard anything about peculiarities of floating-point arithmetic, you might start worrying that floor might return not exact integer value but  $\lfloor a \rfloor - \epsilon$, so that type-casting is incorrect (assume $a$ is positive). However, this is not the case. Consider the following statement.

If $a$ is not integer and is representable as IEEE-754 floating-point number, than both $\lfloor a \rfloor$ and $\lceil a \rceil$ are representable within that domain. This means that for any float/double value floor can return a number that is integer and fits in float/double format.

The proof is easy but requires understanding of representation of floating-point numbers. Suppose $a = c \times b^q, c \in [0.1, 1)$ has $k$ significant digits, i.e. $k$-th digit after the point in $c$ is not null, but all the further ones are. Since it is representable, $k$ is less then the maximum width of significand. Since $a$ is not integer, both $\lfloor a \rfloor$ and $\lceil a \rceil$ have significands with the number of significant digits less then $k$. Rounding however might increase the order of magnitude but only by one. So, the rounded number’s significand fits in $k$ digits, Q.E.D.

However, this does not mean one can always type-cast the output safely, since, for example, not every integer stored in double can be represented as 4-byte int (and this is the answer to the question #1). Double precision numbers have 52 digit significands, so any integer number up to $2^{52}$ can be stored exactly. For the int type, it is only $2^{31}$. So if there is possibility of overflow, check before the cast or use the int64 type.

On the other hand, a lot of int’s cannot be represented as floats (also true for int64 and double). Consider the following code:
std::cout << std::showbase << std::hex
      << int(float(1 << 23)) << " " << int(float(1 << 24)) << std::endl
      << int(float(1 << 23) + 1) << " " << int(float(1 << 24) + 1) << std::endl;

The output is:

0x800000 0x1000000
0x800001 0x1000000


$2^{24}+1$ cannot be represented as float exactly (it is represented as $2^{24}$), so one should not use the float version of floor for numbers greater than few millions.

There are classes of numbers that are guaranteed to be represented exactly in floating point format. See also the comprehensive tutorial on floating-point arithmetic by David Goldberg.

## Kinect Apps Challenge

Graphics & Media Lab and Microsoft Research Cambridge have announced a contest for applications that use Kinect sensor. The authors of the five brightest apps will be funded to attend the 2012 Microsoft Research PhD summer school. I blogged about the last year's event in my previous post. Details of the contest are here.

I guess there's no need to explain what Kinect is. It is extremely successful combined colour and depth sensor by Microsoft. It is being distributed for killer price (≈150) as an add-on for Xbox, although it is hard to imagine the range of possible applications. Controlling a computer only by gestures is considered as a primer of natural user interface. To name some applications beyond gaming, this kind of NUI is useful to help surgeons to keep their hands clean: Kinect helps blind people to navigate through buildings: For more ideas look at the winners of OpenNI challenge. OpenNI is an open-source alternative to Kinect SDK. Its strong point is it is integrated with PCL, but beware that you cannot use it for the contest. Only Kinect for Windows SDK is allowed. Kinect is probably the most successful commercial outcome from the Microsoft Research lab. MSR is unique because they do a lot of theoretical research there, and it is unclear if it is profitable for Microsoft to fund it. But the projects like Kinect reveal the doubts. I will post about MSR organization in comparison to other industrial labs in one of the later posts. ## All the summer schools This summer I attended two conferences (3DIMPVT, EMMCVPR) and two summer schools. I know my latency is somewhat annoying, but it's better to review them now then never. :) This post is about the summer schools, and the following is going to be about the conferences. PhD Summer School in Cambridge Both schools were organized by Microsoft Research. The first one, PhD Summer School was in Cambridge, UK. The lectures covered some general issues for computer science PhD students (like using cloud computing for research and career perspectives) as well as some recent technical results by Microsoft Research. From the computer vision side, there were several talks: • Antonio Criminisi described their InnerEye system for retrieval of similar body part scans, which is useful for diagnosis based on similar cases' medical history. He also featured the basics of Random Forests as an advertisement to his ICCV 2011 tutorial. The new thing was using peculiar weak classifiers (like 2nd order separation surfaces). Antonio argued they perform much better then trees in some cases. • Andrew Fitzgibbon gave a brilliant lecture about pose estimation for Kinect (MSR Cambridge is really proud of that algorithm [Shotton, 2011], this is the topic for another post). • Olga Barinova talked about the modern methods of image analysis and her work for the past 2 years (graphical models for non-maxima suppression for object detection and urban scene parsing). The other great talks were about .NET Gadgeteer, the system for modelling and even deployment of electronic gadgets (yes, hardware!), and F#, Microsoft's alternative to Scala, the language that combines object-oriented paradigm with functional. Sir Tony Hoare also gave a lecture, so I had a chance to ask him how he ended up in Moscow State University in the 60s. It turns out he studied statistics, and Andrey Kolmogorov was one of the leaders of the field that time, so that internship was a great opportunity for him. He said he had liked the time in Moscow. :) There were also magnificent lectures by Simon Peyton-Jones about giving talks and writing papers. Those advices are the must for everyone who does research, you can find the slides here. Slides for some of the lectures are available from the school page. The school talks did not take all the time. Every night was occupied by some social event (go-karting, punting etc.) as well as unofficial after-parties in Cambridge pubs. Definitely it is the most fun school/conference I've attended so far. Karting was especially great, with the quality track, pit-stops, stats and prizes, so special thanks to Microsoft for including it to the program! Microsoft Computer Vision School in Moscow This year, Microsoft Research summer school in Russia was devoted to computer vision and organized in cooperation with our lab. The school started before its official opening with a homework assignment we authored (I was one of four student volunteers). The task was to develop an image classification method capable to distinguish two indoor and two outdoor classes. The results were rated according to the performance on the hidden test set. Artem Konev won the challenge with 95.5% accuracy and was awarded a prize consisted of an xBox and Kinect. Two years ago we used those data for the projects on Introduction to Computer Vision course, where nobody reached even 90%. It reflects not just the lore of participants, but also the progress of computer vision: all the top methods used PHOW descriptors and linear SVM with approximate decomposed χ2 kernel [Vedaldi and Zisserman, 2010], which were unavailable that time! In fact, Andrew Zisserman was one of the speakers. Andrew is the most cited computer vision researcher and the only person whose Zisserman number is zero. :) His course was on Visual Search and Recognition, including instance-level and category-level recognition. The ideas that were relatively new: • when computing visual words, sometimes it is fruitful to use soft assignments to clusters, or more advanced methods like Locality-constrained linear coding [Wang et al., 2010]; • for instance-level recognition it is possible to use query expansion to overcome occlusions [Chum et al., 2007]: the idea is to use the best matched images from the base as new queries; • object detection is traditionally done with sliding window, the problems here are: various aspect ratio, partial occlusions, multiple responses and background clutter for substantially non-convex objects; • for object detection use bootstrapped sequential classification: on the next stage take the false negative detections from the previous stage as negative examples and retrain the classifier; • multiple kernel learning [Gehler and Nowozin, 2009] is a hot tool that is used to find the ideal linear combination of SVM kernels: combining different features is fruitful, but learning the combination is not much better than just averaging (Lampert: “Never use MKL without comparison to simple baselines!”); • movies are common datasets, since there are a lot of repeated objects/people/environments, and the privacy issues are easy to overcome. The movies like Groundhog Day and Run Lola Run are especially good since they contain repeated episodes. You can try to find the clocks on Video Google Demo. Zisserman talked about PASCAL challenge a lot. During a break he mentioned that he annotated some images himself since “it is fun”. One problem with the challenge is we don't know if the progress over years really reflects the increased quality of methods, or is just because of growth of the training set (though, it is easy to check). Andrew Fitzgibbon gave two more great lectures, one about Kinect (with slightly different motivation than in Cambridge) and another about continuous optimization. He talked a lot about reconciling theory and practice: • the life-cycle of a research project is: 1) chase the high-hanging fruit (theoretically-sound model), 2) try to make stuff really work, 3) look for the things that confuse/annoy you and fix them; • for Kinect pose estimation, the good top-down method based on tracking did not work, so they ended up classifying body parts discriminatively, temporal smoothing is used on the late stage; • “don't be obsessed with theoretical guarantees: they are either weak or trivial”; • on the simplest optimization method: “How many people have invented [coordinate] alternation at some point of their life?”. Indeed, the method is guaranteed to converge, but the problems arise when the valleys are not axis-aligned; • gradient descent is not a panacea: in some cases it does small steps too, conjugate gradient method is better (it uses 1st order derivatives only); • when possible, use second derivatives to determine step size, but estimating them is hard in general; • one almost never needs to take the matrix inverse; in MATLAB, to solve the system Hd = −g, use backslash: d = −H\g; • the Friday evening method is to try MATLAB (implementing the derivative-free Nelder-Mead method). Dr. Fitzgibbon asked the audience what the first rule of machine learning is. I hardly helped over replying “Never talk about machine learning”, but he expected the different answer: “Always try the nearest neighbour first!” Christoph Lampert gave lectures on kernel methods, and structured learning, and kernel methods for structured learning. Some notes on the kernel methods talk: • (obvious) don't rely on the error on a train set, and (less obvious) don't even report about it in your papers; • for SVM kernels, in order to be legitimate, a kernel should be an inner product; it is often hard to prove it directly, but there are workarounds: a kernel can be drawn from a conditionally positive-definite matrix; sum, product and exponent of a kernel(s) is a kernel too etc. (thus, important for multiple-kernel learning, linear combination of kernels is a kernel); • since training (and running) non-linear SVMs is computationally hard, explicit feature maps are popular now: try to decompose the kernel back to conventional dot product of modified features; typically the features should be transformed to infinite sums, so take first few terms ; • if the kernel can be expressed as a sum over vector components (e.g. χ2 kernel\sum_d x_d x'_d / (x_d + x'_d)$), it is easy to decompose; radial basis function (RBF) kernel ($\exp (\|x-x'\|^2 / 2\sigma^2)$) is the exponent of a sum, so it is hardly decomposable (more strict conditions are in the paper); • when using RBF kernel, you have another parameter σ to tune; the rule of thumb is to take σ² equal to the median distance between training vectors (thus, cross-validation becomes one-dimensional). Christoph also told a motivating story why one should always use cross-validation (so just forget the previous point :). Sebastian Nowozin was working on his [ICCV 2007] paper on action classification. He used the method by Dollár et al. [2005] as a baseline. The paper reported 80.6% accuracy on the KTH dataset. He outperformed the method by a couple of per cents and then decided to reproduce Dollár's results. Imagine his wonder when simple cross-validation (with same features and kernels) yielded 85.2%! So, Sebastian had to improve his method to beat the baseline. I feel I should stop writing about the talks now since the post grows enormously long. Another Lampert's lecture and Carsten Rother's course on CRFs were close to my topic, so they deserve separate posts (I already reviewed basics of structured learning and max-product optimization in this blog). Andreas Müller blogged about the recent Ivan Laptev's action recognition talk on CVML, which was pretty similar to ours. The slides are available for all MSCVS talks, and videos will be shared in September. There were also several practical sessions, but I personally consider them not that useful, because one hardly ever can feel the essence of a method in 1.5 hours changing the code according to some verbose instruction. It is more of an art to design such tutorials, and no one can really master it. :) Even if the task is well-designed, one may not succeed performing it due to technical reasons: during Carsten Rother's tutorial, Tanya and me spent half an hour to spot the bug caused by confusing input and index variable names (MATLAB is still dynamically typed). Ondrej Chum once mentioned how his tutorial was doomed since half of the students did not know how to work with sparse matrices. So, practical sessions are hard. There was also a poster session, but I cannot remember a lot of bright works, unfortunately. Nataliya Shapovalova who won the best poster award, presented quite interesting work on action recognition, which I liked as well (and it is not the last name bias! :) My congratulations to Natasha! The planned social events were not so exhaustive as in Cambridge, but self-organization worked out. The most prominent example was our overnight walk around Moscow, in which a substantial part of school participants took part. It included catching last subway train, drinking whiskey and gin, a game of guessing hallucinating names of each other, and moving a car from the tram rail to let the tram go in the morning. :) I also met some of OpenCV developers from Nizhny Novgorod there. MSCVS is a one-time event, unfortunately. There are at least three annual computer vision summer schools in Europe: ICVSS (the most mature one, I attended it last year), CVML (held in France by INRIA) and VSSS (includes sport sessions besides the lectures, held in Zürich). If you are a PhD student in vision (especially in the beginning of your program), it is worth attending one of them each year to keep up with current trends in the vision community, especially if you don't go to the major conferences. The sets of topics (and even speakers!) have usually large intersection, so pick one of them. ICVSS has arguably the most competitive participant selection, but the application deadline and acceptance notification are in March, so one can apply to the other schools if rejected. ## Google search by image Last week Google introduced Search by Image feature. There were a handful of web-sites that suggested content-based image retrieval in the Internet, but the quality was low, as I blogged earlier. I repeated the queries TinEye failed at, and Google image search found them both! It found few instances of Marion Lenbach (one of them from this blog, which means the coverage is large!), and I finally remembered the movie from the HOG paper: The Talanted Mr. Ripley. So, from the quick glance Google finally accomplished what the others could not do for ages. Why did they succeed? There are two possible reasons: large facilities that allow building and storing large index efficiently, and a unique technology. The former is surely the case: it seems the engine indexed a large portion of the photos in the web. I cannot say anything about the technology: there is nothing about that among the Google's CVPR papers, so one need to do black-box testing to see which transformations and modifications are allowed. Google seems to expand to the areas of multimedia web, even where the niche is already occupied. Recently they announced their alternative to grooveshark, the recommendation system for Music beta (the service is unfortunately available on invitation basis in the US only). The system is not based on collaborative filtering (only), they (also) analyse the content. I planned to investigate this area too, but given that it is becoming mature and thus not so alluring. I am eager to see if the service will succeed. After all, Buzz did not replace Twitter, and Orkut did not replace Facebook. ## On structured learning This post is devoted to structured learning, a novel machine learning paradigm which may be used for solving different computer vision problems. For example, finding optimal parameters of graphical models is essentially a structured learning problem. I am going to give an introductory talk on structured learning for Dmitry Vetrov's graphical models class this Friday at 4:20 pm, so please feel free to drop by if you are in Moscow (the talk is to be in Russian). Structured learning is basically a very general supervised learning setting. Consider the classification setting as a basic problem. One needs to find parameters of a function that maps a feature vector to one of the pre-defined class labels$\mathbb{R}^m \to \{c_1, c_2, \dots, c_K\}$. The fundamental property is the classes are unordered and orthogonal. The latest means the probabilities of objects classified as different labels are uncorrelated (some negative correlation is normal since strong classifying of an object as$c_i$decreases the probability for rest of the labels). Now consider regression, which is another supervised learning problem where feature vectors map to real values:$\mathbb{R}^m \to \mathbb{R}$. It might be considered a classification problem with infinite number of classes. There are two obvious flaws in this reduction. First, a training set is unlikely to have at least one example for each class. To overcome this one can quantize the codomain and train a classifier over a finite set of class labels. However, it leaves us with the second flaw: the method does not take into account correlation between the possible outcomes. The bins are ordered: the training features that correspond to neighbouring bins should be handled differently from those that correspond to distant bins. That's why some global methods (like linear regression) are used. The similar situation is in the case of a structured outcome. In this case, one usually has a plenty of possible outcomes, but they are not independent. The techniques of structured learning are applicable when the outcomes have some underlying structure, and the elements of the structure have similar sets of features. Also, it should be possible to estimate how bad the prediction is (often in terms of incorrectly predicted elements), which is called structured loss. The methods allow small deviations from the ideal training outcome, which are possible e.g. because of noise, but penalize the substantial ones (just like regression!). The prediction function (parameters of which are tuned) can thus be formalized as a mapping$\mathbb{R}^{m \times l} \to \{c_1, c_2, \dots, c_K\}^l$. The possible example is hidden Markov model learning, where the elements are emission potentials and transition potentials, and the loss can be defined as the number of wrong HMM outputs. According to the upper formalization, there are$l$elements, each represented by$m$features (in practice,$m\$ can vary for different types of elements). Since the labels of transition elements are strictly defined by emission outputs, not every outcome is possible.

Another example of a structured prediction problem is natural language parsing. Given a sentence of a natural language, the corresponding parsing tree is to be constructed. Surprisingly, parsing trees could also be represented as high-dimensional vectors, with constraints applied. To summarize, structure prediction has outcome that is multivariate, correlated and constrained.

Ideally, the parameters of structured prediction should be tuned via likelihood maximization, but this turns out to be intractable due to the need of computing the partition function on each gradient optimization step. That's why the L2-regularized hinge loss is usually minimized. The prediction algorithm is represented as a scoring function over possible outcomes. The task of the learning algorithm is to find the parameters of the scoring function to make it return maximum values for the true outcomes on the training set, and low ones for the instances that are far from optimal. Given that, the margin between good and bad possible outcomes should be maximized (in terms of scoring function value).