Machine learning in facebook

Last week Max Gubin gave a talk on how Facebook exploits machine learning. The talk was more technical then one might had expected, so I share some interesting facts in this posts. I hope it doesn’t disclose any secret information. :)

Max started with a screenshot of the Facebook interface, where each element was highlighted as beneficial of machine learning. Just to name some, they use learning to predict the order of stories in the newsfeed, groups/ads/chat contacts to show you, and even occasions when your account is supposedly hacked, so you need to verify your identity. For most of the tasks learning is probably trivial, but at least two of them involve complicated algorithms (see below).

Facebook engineers face number of difficulties. A user expects to load a page almost instantly, though network infrastructure already imposes some lag. To avoid further delaying, prediction should be done in tens of microseconds. Moreover, half a billion of daily-active users send a lot of queries, so massively-parallel implementations would be too expensive. They have no choice other than sticking to linear models. For example, they train a linear fitness function to rank the stories for the newsfeed (using e.g. hinge loss or logistic loss). It should be trained to satisfy multiple criteria, often contradicting. For example, maximizing personal user experience (showing most interesting stories) might hurt experience of other users (if one has few friends, they are the only users who can read his/her posts) or degrade the system as a whole (showing certain types of news might be not really interesting to anyone, while necessary to improve connectivity of the social network). Those criteria should be balanced in the learning objective, and the coefficients are changing over time. Even the personal user experience cannot be measured easily. The obvious thing to try is to ask users to label interesting stories (or use their Likes). However, such tests are always biased: Facebook tried to use this subjective labelling three times, and all of them were unsuccessful. Users just don’t tell what they really like.

Another challenge is the quickly-changing environment. For example, interest to specific ads may be seasoned. In advertising, one of the strategies is to maximize the click-through rate (CTR). The model for personalized ads should be able to learn online to adapt to changes efficiently. They use probit regression, where online updates can be written in a closed form, unlike to logistic regression (note the linear model again!). It is based on Microsoft’s TrueSkill™ method for learning ranks of players to find good matches and seems similar to what Bing uses for CTR maximization [Graepel et al., 2010].

Finally, Max mentioned the problem of estimating new features. The common practice in the industry is A/B testing, where a group of users is randomly selected to test some feature, and the rest of users are treated as the control group. Then they compare the indicators for those two groups (e.g. average time spent on the website, or clicks made on the newsfeed stories) and apply statistical tests. As usual, samples are typically small. For example, if they want to test a feature for search in Chinese, they take a small group of 10 million users, and hope that some of them will query in Chinese (recall that Facebook is unavailable in China). It is typically hard to prove a statistically significant improvement.

It was partially a hiring event. If you are looking for an internship or a full-time job, you may contact to their HR specialist in Eastern Europe Marina. Facebook also keeps in touch with universities, e.g. invites professors to give talks in their office or develop joint courses. Professors may apply, but I don’t know a contact for that.

Read Users' Comments (3)

PGM-class and MRF parameter learning

I’m taking Stanford CS 228 (a.k.a. pgm-class) on Coursera. The class is great, I guess it provides close to the maximum one can do under the constraints of remoteness and bulkness. The thing I miss is theoretical problems, which were taken aside from the on-line version because they could not be graded automatically.

There is an important thing about graphical models I fully realized only recently (partly due to the class). This thing should be articulated clearly in every introductory course, but is often just mentioned, probably because lecturers consider it obvious. The thing is there is no probabilistic meaning of MRF potentials whatsoever. The partition function is there not only for amenity: in contrast to Bayesian networks, there is no general way to assign potentials of an undirected graphical model to avoid normalization. The loops make it impossible. The implication is one should not assign potentials by estimating frequencies of assignments to factors (possibly conditioned on features) like I did earlier. This is quite a bad heuristic because it is susceptible to overcounting. Let me give an example.

For the third week programming assignment we needed to implement a Markov network for handwriting OCR. The unary and pairwise potentials are somewhat obvious, but there was also a task to add ternary factors. The accuracy of the pairwise model is 26%. Mihaly Barasz tried to add ternary factors with values proportional to trigram frequencies in English, which decreased performance to 21% (link for those who have access). After removing pairwise factors, the performance rose to 38%. Why has the joint model failed? The reason is overcounting evidence: different factor types enforce the same co-occurrences, thus creating bias towards more frequent assignments, and this shows it can be significant. Therefore, we should train models with cycles discriminatively. 

One more thought I’d like to share: graphical model design is similar to software engineering in the way that the crucial thing for the both is eliminating insignificant dependencies on the architecture design stage. 

Read Users' Comments (3)

When and why it is safe to cast floor/ceil result to integer

Happy new year, ladies and gents! I am back, and the following couple of posts are going to be programming-related.

The C/C++ standard library declares floor function such that it returns a floating-point number:
     double floor (      double x );
      float floor (       float x );  // C++ only
I used to be tortured by two questions:

  1. Why would the floor function return anything other than integer?
  2. What is the most correct way to cast the output to integer?

At last I figured out the answer to both questions. Note that the rest of the post applies to the ceil function as well.

For the second point, I knew the common idiom was just type-casting the output:

double a;
int intRes = int(floor(a));  // use static_cast if you don't like brevity
If you’ve heard anything about peculiarities of floating-point arithmetic, you might start worrying that floor might return not exact integer value but  $\lfloor a \rfloor - \epsilon$, so that type-casting is incorrect (assume $a$ is positive). However, this is not the case. Consider the following statement.

If $a$ is not integer and is representable as IEEE-754 floating-point number, than both $\lfloor a \rfloor$ and $\lceil a \rceil$ are representable within that domain. This means that for any float/double value floor can return a number that is integer and fits in float/double format.

The proof is easy but requires understanding of representation of floating-point numbers. Suppose $a = c \times b^q, c  \in [0.1, 1)$ has $k$ significant digits, i.e. $k$-th digit after the point in $c$ is not null, but all the further ones are. Since it is representable, $k$ is less then the maximum width of significand. Since $a$ is not integer, both $\lfloor a \rfloor$ and $\lceil a \rceil$ have significands with the number of significant digits less then $k$. Rounding however might increase the order of magnitude but only by one. So, the rounded number’s significand fits in $k$ digits, Q.E.D.

However, this does not mean one can always type-cast the output safely, since, for example, not every integer stored in double can be represented as 4-byte int (and this is the answer to the question #1). Double precision numbers have 52 digit significands, so any integer number up to $2^{52}$ can be stored exactly. For the int type, it is only $2^{31}$. So if there is possibility of overflow, check before the cast or use the int64 type.

On the other hand, a lot of int’s cannot be represented as floats (also true for int64 and double). Consider the following code:
std::cout << std::showbase << std::hex 
      << int(float(1 << 23)) << " " << int(float(1 << 24)) << std::endl
      << int(float(1 << 23) + 1) << " " << int(float(1 << 24) + 1) << std::endl;
The output is:

0x800000 0x1000000
0x800001 0x1000000

$2^{24}+1$ cannot be represented as float exactly (it is represented as $2^{24}$), so one should not use the float version of floor for numbers greater than few millions.

There are classes of numbers that are guaranteed to be represented exactly in floating point format. See also the comprehensive tutorial on floating-point arithmetic by David Goldberg.

Read Users' Comments (2)