Kallistec

August 29, 2009

Deciding What to Watch on Github

So, the Github contest is over. To spoil the suspense, I didn’t win (I got crushed, actually) but with some Netflix Prize alumni and pro data mining consultants in the mix, my odds were pretty long to begin with. Given these realities, I viewed my entry in the contest mostly as a way to improve my ruby machine learning library, The Decider, so I actively avoided some Github-specific hacks that might have given me a much better score.

My approach to the contest was to find similar users or repos using a k nearest neighbors algorithm. This approach matches our intuition about making recommendations: if you liked Rails, you’ll love will_paginate! Similarly, we can think of finding similar users, as the Amazon recommendation system (apparently) does: “people with prototype in their cart also bought scriptaculous.”

Vectors, Distance Metrics, and Lots of Flops

To start with, we need a way to determine the similarity between users or repos. To do that, we start by representing them as vectors, which for our purposes are basically just arrays with lots of zeroes and a few ones (actually, there’s so many zeroes that it takes too much memory, so, in practice, optimizations need to be made to account for the sparseness of the data). To illustrate this, look at the table below which shows what a small part of the github data might look like:

  Rails   Prototype   Scriptaculous   jQuery   EventMachine   Nanite   Rip  
Alice  yes yes yes no no no no
Bob no no no no yes yes yes
Chandra    yes no no yes no no yes
Dave    yes yes yes yes no no no

In ruby, we’d probably represent that data like this:

github_users = {
	:alice    => [1,1,1,0,0,0,0],
	:bob      => [0,0,0,0,1,1,1],
	:chandra  => [1,0,0,1,0,0,1],
	:dave     => [1,1,1,1,0,0,0]
}

So, now that we have lots of zeroes and ones, how do we figure out how similar two users are? There’s several ways, but we’re limiting ourselves to distance measures and excluding similarity measures, for reasons I’ll explain shortly. What’s the difference? A similarity metric will be exactly 1 for two identical vectors, zero or -1 (depending on the measurement used) for completely dissimilar vectors, and somewhere in between for partially similar vectors. A distance measurement, on the other hand, will be zero for identical vectors, and something greater than zero for vectors that are different.

The simplest distance metric we could implement is probably the Hamming Distance. I won’t discuss it here, except to say that it worked much better than I expected for finding similar users, but, for finding similar repos, it was abysmal. One deficiency of the Hamming distance is that it doesn’t make any corrections for the magnitude of the vectors, that is, the number of repos followed by a user (or the number of followers a repo has). To understand what I mean by this, think about two users who have exactly the same preferences, but one has used github for a much longer time and follows many more projects. The Hamming distance in this case will be fairly large, which doesn’t reflect the two users’ true similarity. Cosine similarity, which is also called the Tanimoto coefficient in the case of binary attributes, has the desired property of taking the size of the vectors into account. We can see how this works for two dimensional vectors below.

VectorExample.png
In the diagram, the vectors represent individual github users, their lengths represent the number of projects the user is following, and the directions show the users’ preferences based on which projects they follow. Given this information, we can see intuitively that Bill and Charlie are the most similar in a way that is useful for making recommendations, even though Anna and Bill follow about the same number of projects and might have a smaller hamming distance.

Observant readers will have noticed that I mentioned using cosine similarity, but ruled out similarity measures. After all, cos(0) is 1, but that’s exactly what I said I didn’t want. Well, one way to solve this is to use the arccosine function to get the angle, which will give us a distance measure, but I simply used (1 – cos(θ)), which gives the desired properties.

To compute the Tanimoto coefficient for binary-valued vectors, we use the Jaccard index formula, which is given by the set intersection divided by the set union. If we compare Alice and Chandra from the table above, we see that the only project they have in common out of the seven total is rails. Taken together, at least one of them watches rails, prototype, scriptaculous, jQuery, and rip (5 total). Therefore, the Tanimoto coefficient for their vectors is 1 / 5 or 0.2. If we compare Alice and Dave, we see that they both watch rails, prototype, and scriptaculous (3 total); at least one of them watches rails, prototype, scriptaculous, and jQuery (4 total) giving them a Tanimoto coefficient of 3 / 4 or 0.75.

Need for Speed

Probably the biggest shortcoming of the k nearest neighbors approach is that these relatively slow distance measurements need to be made between every combination of users or repos, which is quite an undertaking with 50,000+ users and 120,000+ repos. Making an analogy to relational databases, the brute force approach is like tens of thousands of full table scans. In database-land, we use indexes, which create tree structures (ususally B+ Trees) of our data, to limit the number of comparisons needed for each query. So, the natural question to ask is, “can I use a tree structure to make my kNN search faster?”

BK Trees

The reason I limited comparisons between users to distance measurements is to take advantage of BK Trees. BK Trees are organized like this:

  1. Make any arbitrary user’s vector the root node.
  2. For the next user, compute the distance between his vector and the root node’s vector.
  3. Attach the user to the root node along an edge, where the edge is simply the distance between the root node and the user.
  4. Continue the process. If there is already a child node with the same distance, attach the user to the child node

This concept actually is quite simple, but it’s a bit difficult to explain, so an illustration will help.
bktree-concepts.png
In the illustration, we’ve arranged seven users into a BK Tree. We arbitrarily selected Alice to be the root node. We then attached Bob at distance==3, Chris at distance==5, Dee at distance==10, and Evelyn at distance==23. Frank and Ginger also have distance==5 from Alice, but we already added Chris at distance==5, so we attach them to Chris’ node. When we attached Frank to Chris’ node, we measured the distance between Frank and Chris, found that it is two, so we attached Frank’s node to Chris’ node at the edge for distance==2. Similarly, the distance between Chris and Ginger is three, so we attach Ginger’s node to Chris’ node at the edge for distance==3. To represent the tree in ruby, we might start with something like this:

# The BK Tree in Ruby
bob     = :leaf
dee     = :leaf
evelyn  = :leaf
frank   = :leaf
ginger  = :leaf
chris   = {2 => frank, 3 => ginger}
alice   = {3 => bob, 5 => chris, 10 => dee, 23 => evelyn}

Now that we have our users organized into a tree, how do we go about finding the k most similar users? We exploit the properties of Metric Spaces. Metric Spaces are sets where the objects in the set have a distance between each other. The distance function d(x,y) (a.k.a., the metric) must have the following properties:

triangle-inequality.png
The Triangle Inequality
  • The distance beween a vector (or any object) and itself is 0. In math lingo:
    d(x,y) = 0 if and only if x = y
  • d(x,y) ≥ 0 i.e., metrics are scalars, and the shortest possible distance between two objects is zero.
  • d(x,y) = d(y,x) That is, the distance from x to y is the same as the distance from y to x. “Uphill both ways” is not allowed.
  • d(x,z) ≤ d(x,y) + d(y,z) This is the triangle inequality.

Of all of these, the triangle inequality is probably the most interesting. Essentially, it says that if you go from point A to point C by going from A to B to C, you have to have gone at least as far as the distance from A to C.

bktree-search.png
So how does all that help us search the BK Tree and find the k Nearest Neighbors? Let’s start with a slightly easier problem. Say we now have a user HAL, and we want to find all of the users within a distance of 4 from HAL. We start by measuring the distance from HAL to Alice. Let’s say the result is 6. Using the triangle inequality, we know that we need to check every user with a distance ≤ 10 from Alice. How? I’ll leave the formal proof to you, but you can see for yourself in the diagram on the left. You can also see that many nodes that match our triangle inequality criteria aren’t within a distance of 4 from HAL, so each node needs to be checked before we can add it to the result set. To descend through the tree, we simply repeat the process for each non-leaf child node that matched our triangle inequality criteria.

To search through the tree for the k nearest neighbors, we simply set our initial “cutoff distance” to infinity. As soon as we have k results, we know that any results we’re interested in must, at minimum, be better than the worst result we have so far. So we continually adjust the cutoff distance as we descend the tree, hopefully converging on a small value for the cutoff distance that will allow us to search the smallest possible part of the tree.

The hyper-observant reader will have noticed that BK Trees will only be advantageous for integer distance functions, but I’ve used one that gives floating point values. Well, I cheated and multiplied my distance measure by a large constant factor and, for the purposes of the BK Tree edges, converted to an integer. In practice, the improved accuracy of the cosine similarity-based distance metric more than made up for any inaccuracies caused by this slight of hand.


Practical Concerns in the Github Contest

To generate the actual recommendations, I took two approaches. The first one I tried (which gave me a slighly better score by about 1%) was to group the data by user. Having a list of the k most similar users for each user in the test set, I then recommended the most popular repos watched by these “neighbors” that the test set user wasn’t already watching. The second was to find the k most similar repos to each repo in the entire dataset. For each user in the test set, I recommended repos similar to the ones the user was already watching.

As I mentioned above, I decided early on not to use any Github specific “hacks,” such as replacing forked repos with their parents, as I was interested in learning about recommendations in a more general case.

Computing the Recommendations

Even with the optimizations afforded by the BK Tree, generating the recommendations takes a lot of time. So I took the opportunity to get acquainted with Jruby. In fact, without it, it would have been impossible to use ruby for this. Some of the truly enormous computations, such a generating the 50 nearest neighbors of every repo in the dataset, required days of CPU time. With a trusty high CPU XL EC2 instance and Jruby, wall clock time was still measured in hours, but manageable.

Speaking of the BK Tree, because it has no balance constraints, the amount of the tree searched for a given kNN computation isn’t guaranteed to be anything less than the whole tree. On the several occasions that I inspected the efficiency of the BK Tree searches, I noticed wildly differing percentages of the tree were searched, between 5% and 90%. One interesting result was that as I increased the value of k, many search’s efficiency dive-bombed, that is, instead of increasing the amount of the tree searched from 10% to 20%, it would increase from 10% to 80% or 90%. I suspect that the users are naturally clustered, and when k exceeds the size of this natural cluster, there are many equally bad results. It would be interesting to verify this by analyzing the dataset with an edge detection-based clustering algorithm.

Improving the Recommendations

Although I tried using both users and repos as the basis for the kNN groupings, I didn’t get around to combining the results of the two. I speculate that this approach might help overcome the weaknesses inherent in each approach. For example, some users in the test set only watched one or two repos. For these users, picking a “similar user” isn’t likely to be a good approach since we have so little information about them. Conversely, for users watching many projects, picking repos that are similar to the ones they already watch ignores the information about their preferences that we could take advantage of by basing the recommendations on similar users.

Aside from being sloooow, another problem with the pure kNN approach is that no effort is made to reduce noise in the dataset. One could imagine that users might have decided to follow a repo outside of their normal preferences on a whim, or they might not follow a big project, such as rails, even if they’re interested in it, because they hear about what’s going on with that project from other sources (the developers’ blogs and twitter accounts, etc.). Smoothing out that noise is likely to give better predictions. One promising way to do this is via Singular Value Decomposition. Be sure to also have a look at Ilya Grigorik’s post on the topic, if you haven’t already. This is the next item on my list to add to the Decider library; in fact, I’ve already found a suitable Java implementation of the whiz-bang linear algebra to use with Jruby.

Github Specific Approaches?

(This paragraph added on 30 Aug, 2009) Scott Chacon, in a post on the Github blog, expressed some dismay that it was possible to get quite a few correct answers (I’ve read that it’s between 15%–20%) by guessing that users are watching projects they’ve forked. I can agree with his sentiment to some extent, but another way of looking at it is that the project/fork tree is simply another “social graph” structure full of interesting and useful information. In an IRC conversation, careo discussed the idea of putting the data in a Neo4j graph database and generating recommendations using queries against the graph. One interesting avenue of research would be to augment the explicit relations (users following projects, projects forked from projects, etc.) with the implicit relationships found by the distance metrics and use queries against that graph to generate recommendations. The obvious issue with this approach is that different types of measurements are being treated as equal (or at least interchangeable), so I anticipate a good deal of tuning would be required to correctly balance the relative weights of each type of relationship.

Fin

The contest was pretty awesome, and I learned a whole lot in the process, even getting reacquainted with some math I haven’t used in a while. So, thanks are definitely in order for the Github crew for putting their dataset in the open and generating open source activity in a realm of computing that’s all too often sequestered in stuffy journals and corporate trade secrets. I’ll be updating this blog with more information and code examples as I continue to improve the decider, though hopefully it won’t be anything quite so long-winded.

Further Reading

Mea Culpa: I messed up the Tanimoto Coefficient calculations (for Alice & Chandra and Alice & Dave) in earlier revisions of this post. They should be correct now. My Bad.

Blog at WordPress.com.