How does Yelp make its recommendations?

A first attempt at collaborative filtering

Posted by Andrew Reece on November 15, 2013

I used the Yelp dataset from Phoenix, AZ to create a rudimentary recommendation algorithm.

All the source code for this project can be found here.

Here's some background from the initial problem specification (an assignment from the 2013 version of CS109):

To motivate our recommendation system, consider the following example. Let's pretend we are in Boston for a second. Lets say the average rating of restaurants here by all the users is 3.5. Sandrine's at Harvard square is better than an average restaurant, so it tends to be rated 0.5 stars above the average (over all the users). However, you are a curmudgeon, who tends to rate 0.2 stars below the average. Then a baseline estimate for the recommendation for Sandrine's, for you, is 3.5+0.5-0.2=3.8. These baseline estimates thus adjust the data by accounting for the systematic tendencies for some users who give higher ratings than others, and for some restaurants to recieve higher ratings than others. We can write the baseline estimate Y^baseline_um for an unknown rating Yum for user u and restaurant or business m as:
where the unknown parameters θu0 and γm0 indicate the deviations, or biases, of user u and item m, respectively, from some intercept parameter μ. Notice that the θu0 and γm0 are parameters which need to be fit. The simplest thing to start with is to replace them by their "mean" estimates from the data. Thus:
where Y¯u = user_avg, the average of all a user u's ratings and Y¯m = business_avg, the average of all ratings for a restaurant m. Y¯ is the average rating over all reviews. The final two terms correspond to the user-specific and item-specific bias in ratings, that is, how their ratings tend to systematically diverge from the global average. This is the simplest possible way to predict a rating, based only on information about this user and this restaurant. Can we do a better job of predicting the rating Yum user u would give to restaurant r? According to the central dogma of Collaborative Filtering, we ought to be able to use the responses of similar users regarding similar restaurants to get a better prediction. We can make an estimate of Yum as:
where sk(m) is the k neighbor items of item m based on some pooling criterion, for example, those items which have been rated by user u.

If that's a little mathy, an easier way to put it is that we need to know roughly how an average restaurant-goer ("user") behaves, and we also need to know roughly how an average restaurant ("item") gets rated across all users. Those two measures serve as our baselines when we're trying to figure out how to connect users' preferences for different items.

One problem that needs addressing with review algorithms is known as sparsity. That's basically the idea that there are a lot of users who don't rate many things, and there are a lot of items without many ratings. So, regardless of the number of nodes in a network, a sparse one means there aren't many edges.

When trying to get a good handle on a local scene of restaurants and user preferences, it's helpful to train your model on a subset of users and items that are highly active (ie. a denser sub-network). In our case, you can see that, in fact, the majority of users only rate something once or twice, and the majority of items have very few ratings:

So I selected a subset of all the data using parameters that ensured higher network density. Now the frequency graphs look better:

The next task is to come up with some way of rating similarity between users. To do this, I used a K-nearest-neighbors approach, with Pearson's zero-order correlation coefficients as indicators. Now, given a specific item (ie. restaurant), I can describe its nearest "neighbors" in terms of similar users and their preferences. If a bunch of people who eat at Joe's, also like eating at Hank's, then I would say that Hank's is a close neighbor to Joe's. Here's an example for one restaurant called "Cafe Monarch":

By way of comparison, here are nearest neighbors for a local burger shack - you can tell by the names that this joint keeps different neighbors nearby!

We can also perform a similar computation to predict items that users might like. In the case of one user ("Vern"), we can give him recommendations for 5 restaurants he'd probably like, based on what other users similar to him have enjoyed. Here are the top 5 recoms for Vern:

More interesting, though, would be to predict not only what restaurants Vern might like, but also what ratings he would give. It's not too far a leap from simply predicting item names, to computing actual predicted ratings. It's still based on the idea that similar users do similar things - so we can ballpark what we think Vern's rating will be for nearby restaurants.

Now, since the subset we're using actually has Vern's true ratings for these items, we can see how well our predictions fared:

You may notice that the predicted ratings are a little low compared to Vern's actual ratings. Why might this be? Well, on average, Vern is a slightly "curmudgeonly" user within the smaller dataset.

To relate that back to our mathy origins, this means his baseline rating is affected negatively by his user average, as its difference from the overall mean is subtracted from his baseline M and J scores. It's not only a deduction in baseline M, but also in the neighbor/similarity value that gets added onto the baseline M score. So it counts as a double whammy against the Yhat_um. When the recommended restaurants are above the overall average and the similarity is high, this attenuates Vern's grump factor somewhat. </math>

To get a better estimate of Vern's preferences and ratings, we can crack open the full dataset and compute similarity scores based on the whole enchilada. The real dataset is massive, though - many millions of cells in a huge matrix - so we'll come back to that task a bit later.

We haven't really touched upon the idea of measurement error yet. There are actually a few parameters in the algorithms we've been using, which can be tweaked to allow a greater or lesser ratio of variance-to-bias (for more on the variance-bias tradeoff, see this great explanation). Without going too much in detail here (there's quite a bit more commented explanation in this project's iPython notebook), you can see that changing the parameters by which we compute our values (here, "k" and "reg") can make our prediction signal extremely fuzzy - the gray swath around the fit lines indicates variance:

I used a different approach to get a better handle on error measurement and predictive accuracy. This time I tried a Bayesian model: Gibbs sampling from the posterior distribution. In very general terms, the sampler picks up latent factors which are implied by the data it has to work with, and it makes its predictions, ratings, and error measurements based on these latent, or hidden, factors which drive user preferences and item ratings.

This model worked pretty well - compare its prediction graph (especially the error variance) with the ones above:

So, we have a better model for making predictions. Good! Now, let's come back to the idea of using the full dataset to make more accurate recommendations for our friend Vern. This poor MacBook of mine would flip out and die if I asked it to run our model on all the Yelp data, so we need some help. That's where MapReduce comes in (along with its Yelp-designed Python package, MRjob). In brief, MapReduce is a way of distributing really big computing tasks across many different machines. Amazon Web Services offers a feature that takes care of this distribution for us - we just need to tell it what to do.

Using MapReduce requires a different way of thinking about writing code. Since you're not actually holding things (like arrays and lists) in a single computer's working memory, you're really asking a program to only proceed one iteration at a time, and to wait as new data goes out or comes in from the swarm of helpers. It still confuses me a bit.

The upshot is that we were able to compute much more accurate rating predictions for old Vern - have a look:

These ratings are much closer than before. This is because now we are basing our predicted ratings off of the full dataset - this means a more informed Y_avg, Y_avg_M, Y_avg_J, and, perhaps most importantly, a more stable set of similarity scores.

All of these values go into constructing our predicted ratings for Vern, and now that we have a massive table of similarities and averages to draw from, it makes sense that our predictive accuracy should improve. I guess that's why Yelp made MRjob!