Kallistec

February 1, 2010

The Chef Way Episode 2: Chef Speak

Filed under: Chef — Daniel DeLeo @ 10:24 am
Tags: , , , ,

Chef Speak

Express yourself completely,
then keep quiet.

Perhaps the biggest hurdle to overcome when learning Chef, especially for those new to configuration management software, is the terminology. Learning the language is an absolute necessity for understanding Chef and being able to explain problems when they arise. The difficulty in learning Chef’s language, I think, has two root causes: firstly, Chef’s abstractions of system configuration concepts appear novel to users unfamiliar with them—users might be learning the abstractions with little frame of reference; and secondly, it’s hard to find authoritative definitions of what the terms mean in Chef. This makes it more difficult to learn Chef by reading the wiki, since you won’t know what to look at until you know what the words mean. Luckily, Chef’s concepts are pretty intuitive, so learning to speak like a chef won’t be difficult. Here we go:

Resource

A resource is usually a cross platform abstraction of the thing you’re configuring on the host. For example, packages may be installed via apt, yum, or the BSD ports and packages systems, but the package resource abstracts these differences away so you can specify that a package should be installed in a cross-platform way. Chef’s resources are mostly just containers for data, with some basic validation functionality.

Attribute (On a Resource)

As I just noted, resources are mostly containers for data. Attributes are the pieces of data that resources contain. In the case of managing a package, this might be the name of the package you want to install, the version you want to install, or options to pass to the package manager.

NOTE: The Chef developers have discussed renaming these “properties” or something similar, to avoid overloading the term “attribute,” which is also used to describe data associated with nodes and roles, described below. For now, though, the documentation says “attribute,” so that’s what I’m using here.

Action

The action is what you want Chef to do with the resource: should the package be installed? Upgraded to the latest version? Removed? Actions are usually specific to the resource, but all resources support the nothing action, which does what its name suggests.

Provider

The provider is the specific implementation of what the resource abstracts. On Red Hat or CentOS, a package resource will use the yum package provider to get the package installed, but on Debian and Ubuntu, the apt package provider will be used. Providers contain most of the intelligence: they’re responsible for making Chef idempotent by checking if an action needs to be taken and issuing the commands to the system to take that action. In the case of package providers, they first check if the desired version of a package is installed and run the yum, apt-get, or or other package manager commands to install or upgrade as needed. When working with Chef, you normally don’t need to worry about providers. For the occasions when you do, Chef provides “shortcut” resources which will always use the desired provider. The dpkg_package and rpm_package resources allow you to install packages directly from the filesystem, using providers specific to these package managers, for example.

Templates

Templates, as one would expect, are files with all of the important data substituted with code so we can fill them in later. You can use templates to create any kind of file you like, though the most common use is to create configuration files with host-specific parameters filled in dynamically. Using templates involves a template resource, which Chef backs with a template provider, and the template file itself.

Chef templates use ERb (Embedded Ruby) syntax, though Chef uses the erubis implementation of ERb for speed.

Recipe

Recipes are the files where you write your resources. Recipes can also contain arbitrary ruby code, but you need to understand a little bit about how Chef runs to make productive use of this. Each Chef run is a two stage process: in the first step, usually called the compilation step, Chef evaluates the recipe files, building up a list of the resources. In the next step, Chef executes the desired action for each resource on the provider for that resource. Any arbitrary code in a recipe will be run during the compilation step, not the execution step. To defer execution until the execution phase of the Chef run, use the ruby_block resource.

Node

A node is a host that runs Chef. The primary features of a node, from Chef’s point of view, are its attributes and its run list. In the distant future, Chef may support a “virtual node” concept, where the Chef client runs on one host but configures another, such as a router or switch that can’t run ruby. For now, though, the Chef client is assumed to be running on the node it configures. This means that every host you want to manage with chef will need ruby and the Chef client installed.

Role

A role provides a means of grouping similar features of similar nodes. At web scale, you almost never have just one of something, so you can use roles to express the parts of the configuration that are shared by a group of nodes. Roles consist of the same parts as a node: attributes and a run list. When the Chef client runs, it merges its own attributes and run list with those of any roles it has been assigned.

Attribute (On a Node or Role)

Nodes and roles have associated attributes, which are a structure of nested key–value pairs. Node and role attributes are commonly used as inputs for resource attributes. For example, your production boxes might be using one version of nginx, but you’d like to install a newer version on your staging servers for testing. By using a node or role attribute to specify the version, you can use the same recipe in both environments.

Chef allows attributes to be set in attribute files (among other myriad ways). Code in these files accesses the node chef is running on, and manages attributes on that node directly. In ruby terms, the value of self within attribute files is the node. Using attribute files, you can rely on a node having a sane value for an attribute when writing a recipe without having to worry that the node may not have defined that attribute.

Advanced Chef users often make heavy use of attributes defined in roles to manage attributes on many nodes at once.

Cookbook

A cookbook is a collection of the various related files, such as recipes, templates, and attributes files that chef uses to configure a system, plus metadata. Cookbooks are typically grouped around configuring a single package or service. The MySQL cookbook, for instance, contains recipes for both client and server, plus an attributes file to ensure the attributes used in the recipes are available if they haven’t been defined on the node in some other way.

Metadata

Cookbooks often rely on other cookbooks for pre-requisite functionality. In order for the server to know which cookbooks to ship to a client, a cookbook that depends on another one needs to express that dependency somewhere. That “somewhere” is in its metadata. Dependency tracking is the most visible part of metadata, but metadata also can contain information about authorship, licensing, a description of the cookbook, what platforms the cookbook is expected to work on, and whether or not a cookbook plays nice with other cookbooks. At the moment, Chef supports many more fields for metadata than it actually uses, but maintaining accurate dependency information is absolutely essential, since nodes may not get all of the cookbooks they need if this information is incomplete.

Run List

In the simplest case, the run list is just an ordered list of the recipes that a node should run. Assuming the cookbook metadata is correct, you can put just the recipes you want to run in the run list, and dependent recipes will be run automatically as needed.

In the more complicated case, the run list will include roles that a node has been assigned along with any recipes set on the node explicitly. In this case, when the Chef client runs, the run list is “expanded” into a list of recipes by replacing the role’s entry in the run list with the list of recipes the role specifies in its run list.

References

January 23, 2010

The Chef Way

Filed under: Chef — Daniel DeLeo @ 11:22 am
Tags: , , , , , ,

In the beginning

A few years ago, I discovered the ruby language. After just a few hours using it, I became so enamored that I decided to use it for all of the little systems administration automation tasks that have traditionally been the domain of perl and shell scripts. As my journey into ruby continued, I discovered that many rubyists are also big proponents of the programming methodologies collectively termed “agile.” Reading the agile manifesto, I had two reactions: firstly, it made a lot of sense; secondly, I wondered how, if embracing change is the norm, does one reconcile this with the inherent difficulties in adapting operations to a changing environment? Much to my dismay, googling for “agile systems administration” gave me a single result, a blog post whose author was asking those same questions without providing the answers. That has all changed: now we have pervasive virtualization, infrastructure as a service, and many high profile proponents of an agile approach to operations, some retaining the term agile, others labeling the movement “devops” (I think “opsdev” sounds better, but you can’t win ‘em all). The changes in this approach to operations are far-reaching, with organizational and even philosophical aspects, accompanied by new (or just newly discovered) tools to match the new understanding.

Arguably the most important shift in tooling is the embrace of configuration management and automation tools. Enabled by the greatly reduced lead times required to provision new systems, configuration management is key to building the kind of dynamic, change-responsive infrastructures I began dreaming about while reading the agile manifesto all those years back. Although the history of configuration management may stretch back to the original Bourne or even Thompson shells, and hit a major milestone in 1993 with the first release of Cfengine, both the variety and use of these tools has risen dramatically.

The Chef Way

The Master observes the world
but trusts his inner vision.

Given the goal of “Operations Zen,” not fighting change, but using it to our advantage, embracing a duality of operations and development, and, yes, being productive and doing more with less, how should we achieve it? I contend that the tool to use is Chef, and the path to follow is the Chef way. This is the beginning of a series of posts covering Chef’s design decisions, its terminology, and how to use Chef. Once we get to using Chef, I’ll highlight how doing things the Chef way helps us build agile, fully automated infrastructures, but in this post I’ll describe what the Chef way means in terms of Chef’s design.

So, what is the Chef way? For starters, flexibility. Opinionated software is great, but if it’s so opinionated that it makes your job harder… that’s too opinionated. So Chef tries to be opinionated where it simplifies, and let you call the shots where it matters.

An Internal Ruby DSL.

The more prohibitions you have,
the less virtuous people will be

Chef’s configuration language is implemented within Ruby, instead of outside of it. This is probably the most defining characteristic of Chef, and maybe also the most controversial. Instead of an application that the user merely configures, by using ruby as the configuration language, Chef is both a tool and a framework. Some have described Chef as “half of an application” for this reason. But what does this actually mean in practice? Can you use Chef if you’re not a “1337″ Merb hacker? Well, yes and no, or “mu!”

Chef provides all of the basic building blocks one would expect from a configuration managment application: one can install packages, write templated configuration files, start and stop services, manage users, and generally automate all of the common administration tasks completely within the pre-defined DSL. If you’re learning Chef with no prior knowledge of ruby, this may be a bit more difficult than learning a simpler language specific to a configuration management tool, but not by much. However, in the process of learning the Chef DSL, something else has happened: the user has learned basic ruby syntax, and taken a step into a world much wider than configuration management. An important benefit of this is that it makes it easy to modify the workings of Chef, from small tweaks and additions, to much more drastic changes, even from within Chef recipes. There’s no cognitive barrier of using an external DSL and then needing to learn or use a different language to extend Chef.

This leads directly to the great advantage Chef gains by using a pure ruby DSL: as Bryan McLellan wrote recently, Chef embraces a dual nature of being both a tool and a mere library, and fills the continuum in between. Some of the advantages are very simple: we can use ruby’s array literal syntax to conveniently express many similar actions. Others are more complex: we can use a SQL library to configure systems based on information in a database. Using ruby as the configuration language allows us the expressiveness to describe our systems as they actually exist within the entire infrastructure and gives us access to a large collection of libraries to integrate with that infrastructure. Using Chef, we are not limited to seeing our systems as isolated pieces, we may also consider the whole.

Determinism

Though Chef provides great flexibility in what we may ask it to do, it enforces a deterministic order in which operations occur. People who still remember manual systems administration (what?) will find this a familiar concept: when you write or follow documentation, a how-to for example, you never see (or write) anything like, “do steps 1a, 1b, and 1c in some random order, then do steps 2a, 2b, 2c in a random order, then…” The idea that order matters is not original to Chef, and is also a matter of some controversy among the infrastructure management community. Expressed simply, order matters because each step our configuration management tool takes to deliver a fully configured system is taken within the context of the state of the entire system, and there can be complex, non-obvious dependencies between these actions. Steve Traugott and Lance Brown explored the benefits of deterministic order in depth in their 2002 paper, “Why Order Matters: Turing Equivalence in Automated Systems Administration” and concluded, “it appears that no tool, written in any language, can predictably administer an enterprise infrastructure without maintaining a deterministic, repeatable order of changes on each host.” Among many salient points, Traugott and Brown refer to a 2002 study showing that installing RPMs with no declared dependencies, and with installation scripts disabled, in different orders can lead to different outcomes. Given the overwhelming complexity of mapping these hidden dependencies, it is not surprising that the map rarely matches the territory, or that Chef’s answer is not to build a better map.

Chef’s method for ordering dependent operations is simple and elegant: it runs actions in the order written. To install an application and then configure it, you simply declare that the package should be installed at the top of the file, and declare its configuration at the bottom. Of course, a system’s configuration can be split up between many files, with dependencies across files; in Chef, these dependencies are evaluated according to simple, deterministic rules so that it’s easy to know exactly what will happen and in what order. And once you’ve built one system and tested it, you can be confident that every other system you build will be built in exactly the same way.

Consistency

If you don’t realize the source,
you stumble in confusion and sorrow.

This one is related to determinism: Chef tries to manage systems so that they can always be rebuilt from scratch. To illustrate this point, consider the way Chef manages dynamic, templated files. Chef always writes the complete file instead of inserting text willy-nilly in some arbitrary spot. If you make a change to the template or data used to generate the file, Chef rewrites the entire file. This encourages consistency because there is no way the file can have content that was inserted by Chef at one time but later removed from Chef’s management. As a result, you get the same configuration files no matter the history of the server being managed.

Declarative and Idempotent

Chef is declarative: when working with Chef, you specify the end result, and let Chef worry about how it’s achieved. In English, you can think of every action in Chef having the word “shall” attached to it. “The system shall have this package installed and shall have this configuration file.” Chef is also idempotent: no matter how many times you run it, it leaves the system in the same state. This has a variety of benefits. For one, you now have executable documentation of how your boxes are configured. Want to know how a system was built? Want to build a new one? You can get the answer to both in the same place. When you need to make a change, upgrading a certain package, for example, you make that change in one place. There’s no drift between what’s actually on the system and what the documentation says, they’re one and the same.

Another important aspect of automating infrastructures over time is staying automated. With shell scripts designed to be run once, changing a few variables and re-running the scripts on a live system is probably not going to end well. Back to our upgrading-a-package example, when we make the change with Chef, Chef will see that we have a different version of the package on the system than specified and install the new version, but take no action where the system matches what’s been specified. This means that systems are built and maintained in exactly the same way. As a result, systems that are supposed to be identical are identical, whether they were built last week or last year, and building a new one to match follows the exact same process. And what happens when things go wrong? With run-once-only scripts, you might hunt through the script to figure out which line it was executing when it failed, comment out everything that came before it in the script and try again, or maybe finish what the script was trying to do manually. Yuk. With Chef, assuming the error was transient, you can just run it again. Everything that it finished before the Chef run failed will match the specification and be skipped. Much better.

Fail Fast

If you want to be reborn,
let yourself die.

When something goes wrong configuring a host, Chef fails immediately. From one point of view, there’s nothing else Chef could do, since many dependencies are only expressed implicitly. In the example where we install a package and then configure it, the configuration step clearly depends on the installation step, but Chef doesn’t track this dependency at all (unless we specifically ask it to), so it wouldn’t know to cancel configuration if the package install failed. If we step back a bit and look at the whole view of what Chef is trying to accomplish, however, we see that this isn’t really a problem. After all, we’re asking Chef to deliver us a fully configured webserver, load balancer, coffee machine, whatever. If it works fine with half of its configuration missing, we probably didn’t need that half of the configuration in the first place. So it turns out that the best thing Chef can do in case of failure is to fail. If the failure was transient, say, a network issue, you can run Chef again and you’ll get the fully configured server. If the failure was caused by something more significant, you’d have to fix it before you could get the fully configured server you wanted, whether or not Chef ignored the failure. By failing fast, Chef alerts us to problems immediately, without losing any important functionality.

Configuration In Context

Another unique feature of Chef is that it allows individual hosts to query for information about the rest of your infrastructure—Chef supports this out of the box. This enables truly dynamic infrastructures and keeps our configurations DRY. The advantages of this feature are most apparent when configuring applications such as load balancers and monitoring systems, where the configuration depends on what other infrastructure is available. A load balancer, for example, needs to know what backends are available to balance the load across. If Chef didn’t have the ability to query for information, then for each backend we add or remove, we’d have to also update the load balancer’s configuration. Instead, all we have to do is ask for a list of the backends, and add each one to the load balancer’s configuration programmatically. When a new backend is added or removed, the next time Chef runs on the load balancer host, it will automatically generate a new configuration file, keeping the load balancer’s view of the available backends up to date.

Fat Client, Skinny Server

If you’ve ever used monitoring system software, you’re familiar with this debate: does the client or the server do all of the work? With Chef, the clients do the work. It’s a bit inaccurate to call the server “skinny:” especially with the upcoming 0.8.0 release, the server has a fair amount of complexity, mostly in support of making its searchable infrastructure features rock-solid and efficient. Where “fat client, skinny server” comes in to play is that Chef runs all of the code and template rendering required to configure a system on the client. This design allows Chef to scale down: Chef has a lightweight, serverless, “solo” version of the client, suitable for running on a single host (as an aside, chef-solo is a great way to get started with Chef—that way you can tackle Chef’s features in chunks instead of having to learn all of the parts at once). With the server doing less, Chef can also scale up diagonally like a modern web application, and handle more clients-per-server than it could if the server shouldered more of the burden.

“Skinny server” also means that Chef never runs outside code on the server. Running a centralized configuration management system does entail placing trust in the central server not to hand your entire infrastructure over to the baddies. Running external code safely is one of the most complicated problems in all of computing, and the benefits of doing so in Chef are negligible compared to the risks. Pushing computation to the clients buys us scalability and gives us some extra security for free.

APIs: you build the other half of this.

If you want to become whole,
let yourself be partial.

We touched on this point briefly while discussing Chef’s pure ruby DSL, but throughout its design, Chef strives to make external tools first class citizens. Any information stored by Chef can be accessed via an HTTP API. If you can imagine a new use for Chef’s data or a new way to interact with it, you can build it.

Pragmatic Abstraction

The mark of a moderate man
is freedom from his own ideas.

Chef uses abstractions where they are helpful, but Chef never pretends that abstractions are perfect. Usually, we’d like to simply say “install the MySQL package” and have it work on every platform and package manager. Unfortunately, different distros and OSs disagree on what packages should be named. So Chef lets us break the abstraction and specify a package name for each platform. Another common issue with abstractions is that they commonly force us to use only the lowest common denominator subset of what each underlying implementation is capable of. Again, Chef lets us choose: we can use the cross-platform subset of features when that’s good enough, but we also have access to fully-featured, platform-specific implementations when we need them.

Chef also embraces the idea that when abstractions are useful, we can often gain even more productivity by layering another abstraction on top of it. Taking an example from outside Chef, TCP and IP abstract away all of the vagaries of wires and wire-level protocols, allowing us to focus on sending data to a remote system. So what do we do with it? Layer other abstractions such as SMTP and HTTP on top, so we can forget about sending data and think about sending mail or getting text documents. Likewise, with Chef, we can build our own abstraction layers on top of the ones Chef provides, and we can build programmatically build up abstractions using data from anywhere. A simple example of this is taking a list of package names and building a list of package objects from it. This way, we can minimize repetition (stay DRY!), express our infrastructures succinctly, integrate Chef with any data source we need, and maximize automation.

Wrapping Up

In this post, we’ve covered the Chef why, seeing that Chef’s design focuses on flexibility, repeatability, and pragmatism. In later posts in the series, we’ll learn to speak like a Chef, and how to use Chef to build infrastructures that are both agile and robust.

More in This Series

References

August 29, 2009

Deciding What to Watch on Github

So, the Github contest is over. To spoil the suspense, I didn’t win (I got crushed, actually) but with some Netflix Prize alumni and pro data mining consultants in the mix, my odds were pretty long to begin with. Given these realities, I viewed my entry in the contest mostly as a way to improve my ruby machine learning library, The Decider, so I actively avoided some Github-specific hacks that might have given me a much better score.

My approach to the contest was to find similar users or repos using a k nearest neighbors algorithm. This approach matches our intuition about making recommendations: if you liked Rails, you’ll love will_paginate! Similarly, we can think of finding similar users, as the Amazon recommendation system (apparently) does: “people with prototype in their cart also bought scriptaculous.”

Vectors, Distance Metrics, and Lots of Flops

To start with, we need a way to determine the similarity between users or repos. To do that, we start by representing them as vectors, which for our purposes are basically just arrays with lots of zeroes and a few ones (actually, there’s so many zeroes that it takes too much memory, so, in practice, optimizations need to be made to account for the sparseness of the data). To illustrate this, look at the table below which shows what a small part of the github data might look like:

  Rails   Prototype   Scriptaculous   jQuery   EventMachine   Nanite   Rip  
Alice  yes yes yes no no no no
Bob no no no no yes yes yes
Chandra    yes no no yes no no yes
Dave    yes yes yes yes no no no

In ruby, we’d probably represent that data like this:

github_users = {
	:alice    => [1,1,1,0,0,0,0],
	:bob      => [0,0,0,0,1,1,1],
	:chandra  => [1,0,0,1,0,0,1],
	:dave     => [1,1,1,1,0,0,0]
}

So, now that we have lots of zeroes and ones, how do we figure out how similar two users are? There’s several ways, but we’re limiting ourselves to distance measures and excluding similarity measures, for reasons I’ll explain shortly. What’s the difference? A similarity metric will be exactly 1 for two identical vectors, zero or -1 (depending on the measurement used) for completely dissimilar vectors, and somewhere in between for partially similar vectors. A distance measurement, on the other hand, will be zero for identical vectors, and something greater than zero for vectors that are different.

The simplest distance metric we could implement is probably the Hamming Distance. I won’t discuss it here, except to say that it worked much better than I expected for finding similar users, but, for finding similar repos, it was abysmal. One deficiency of the Hamming distance is that it doesn’t make any corrections for the magnitude of the vectors, that is, the number of repos followed by a user (or the number of followers a repo has). To understand what I mean by this, think about two users who have exactly the same preferences, but one has used github for a much longer time and follows many more projects. The Hamming distance in this case will be fairly large, which doesn’t reflect the two users’ true similarity. Cosine similarity, which is also called the Tanimoto coefficient in the case of binary attributes, has the desired property of taking the size of the vectors into account. We can see how this works for two dimensional vectors below.

VectorExample.png
In the diagram, the vectors represent individual github users, their lengths represent the number of projects the user is following, and the directions show the users’ preferences based on which projects they follow. Given this information, we can see intuitively that Bill and Charlie are the most similar in a way that is useful for making recommendations, even though Anna and Bill follow about the same number of projects and might have a smaller hamming distance.

Observant readers will have noticed that I mentioned using cosine similarity, but ruled out similarity measures. After all, cos(0) is 1, but that’s exactly what I said I didn’t want. Well, one way to solve this is to use the arccosine function to get the angle, which will give us a distance measure, but I simply used (1 – cos(θ)), which gives the desired properties.

To compute the Tanimoto coefficient for binary-valued vectors, we use the Jaccard index formula, which is given by the set intersection divided by the set union. If we compare Alice and Chandra from the table above, we see that the only project they have in common out of the seven total is rails. Taken together, at least one of them watches rails, prototype, scriptaculous, jQuery, and rip (5 total). Therefore, the Tanimoto coefficient for their vectors is 1 / 5 or 0.2. If we compare Alice and Dave, we see that they both watch rails, prototype, and scriptaculous (3 total); at least one of them watches rails, prototype, scriptaculous, and jQuery (4 total) giving them a Tanimoto coefficient of 3 / 4 or 0.75.

Need for Speed

Probably the biggest shortcoming of the k nearest neighbors approach is that these relatively slow distance measurements need to be made between every combination of users or repos, which is quite an undertaking with 50,000+ users and 120,000+ repos. Making an analogy to relational databases, the brute force approach is like tens of thousands of full table scans. In database-land, we use indexes, which create tree structures (ususally B+ Trees) of our data, to limit the number of comparisons needed for each query. So, the natural question to ask is, “can I use a tree structure to make my kNN search faster?”

BK Trees

The reason I limited comparisons between users to distance measurements is to take advantage of BK Trees. BK Trees are organized like this:

  1. Make any arbitrary user’s vector the root node.
  2. For the next user, compute the distance between his vector and the root node’s vector.
  3. Attach the user to the root node along an edge, where the edge is simply the distance between the root node and the user.
  4. Continue the process. If there is already a child node with the same distance, attach the user to the child node

This concept actually is quite simple, but it’s a bit difficult to explain, so an illustration will help.
bktree-concepts.png
In the illustration, we’ve arranged seven users into a BK Tree. We arbitrarily selected Alice to be the root node. We then attached Bob at distance==3, Chris at distance==5, Dee at distance==10, and Evelyn at distance==23. Frank and Ginger also have distance==5 from Alice, but we already added Chris at distance==5, so we attach them to Chris’ node. When we attached Frank to Chris’ node, we measured the distance between Frank and Chris, found that it is two, so we attached Frank’s node to Chris’ node at the edge for distance==2. Similarly, the distance between Chris and Ginger is three, so we attach Ginger’s node to Chris’ node at the edge for distance==3. To represent the tree in ruby, we might start with something like this:

# The BK Tree in Ruby
bob     = :leaf
dee     = :leaf
evelyn  = :leaf
frank   = :leaf
ginger  = :leaf
chris   = {2 => frank, 3 => ginger}
alice   = {3 => bob, 5 => chris, 10 => dee, 23 => evelyn}

Now that we have our users organized into a tree, how do we go about finding the k most similar users? We exploit the properties of Metric Spaces. Metric Spaces are sets where the objects in the set have a distance between each other. The distance function d(x,y) (a.k.a., the metric) must have the following properties:

triangle-inequality.png
The Triangle Inequality
  • The distance beween a vector (or any object) and itself is 0. In math lingo:
    d(x,y) = 0 if and only if x = y
  • d(x,y) ≥ 0 i.e., metrics are scalars, and the shortest possible distance between two objects is zero.
  • d(x,y) = d(y,x) That is, the distance from x to y is the same as the distance from y to x. “Uphill both ways” is not allowed.
  • d(x,z) ≤ d(x,y) + d(y,z) This is the triangle inequality.

Of all of these, the triangle inequality is probably the most interesting. Essentially, it says that if you go from point A to point C by going from A to B to C, you have to have gone at least as far as the distance from A to C.

bktree-search.png
So how does all that help us search the BK Tree and find the k Nearest Neighbors? Let’s start with a slightly easier problem. Say we now have a user HAL, and we want to find all of the users within a distance of 4 from HAL. We start by measuring the distance from HAL to Alice. Let’s say the result is 6. Using the triangle inequality, we know that we need to check every user with a distance ≤ 10 from Alice. How? I’ll leave the formal proof to you, but you can see for yourself in the diagram on the left. You can also see that many nodes that match our triangle inequality criteria aren’t within a distance of 4 from HAL, so each node needs to be checked before we can add it to the result set. To descend through the tree, we simply repeat the process for each non-leaf child node that matched our triangle inequality criteria.

To search through the tree for the k nearest neighbors, we simply set our initial “cutoff distance” to infinity. As soon as we have k results, we know that any results we’re interested in must, at minimum, be better than the worst result we have so far. So we continually adjust the cutoff distance as we descend the tree, hopefully converging on a small value for the cutoff distance that will allow us to search the smallest possible part of the tree.

The hyper-observant reader will have noticed that BK Trees will only be advantageous for integer distance functions, but I’ve used one that gives floating point values. Well, I cheated and multiplied my distance measure by a large constant factor and, for the purposes of the BK Tree edges, converted to an integer. In practice, the improved accuracy of the cosine similarity-based distance metric more than made up for any inaccuracies caused by this slight of hand.


Practical Concerns in the Github Contest

To generate the actual recommendations, I took two approaches. The first one I tried (which gave me a slighly better score by about 1%) was to group the data by user. Having a list of the k most similar users for each user in the test set, I then recommended the most popular repos watched by these “neighbors” that the test set user wasn’t already watching. The second was to find the k most similar repos to each repo in the entire dataset. For each user in the test set, I recommended repos similar to the ones the user was already watching.

As I mentioned above, I decided early on not to use any Github specific “hacks,” such as replacing forked repos with their parents, as I was interested in learning about recommendations in a more general case.

Computing the Recommendations

Even with the optimizations afforded by the BK Tree, generating the recommendations takes a lot of time. So I took the opportunity to get acquainted with Jruby. In fact, without it, it would have been impossible to use ruby for this. Some of the truly enormous computations, such a generating the 50 nearest neighbors of every repo in the dataset, required days of CPU time. With a trusty high CPU XL EC2 instance and Jruby, wall clock time was still measured in hours, but manageable.

Speaking of the BK Tree, because it has no balance constraints, the amount of the tree searched for a given kNN computation isn’t guaranteed to be anything less than the whole tree. On the several occasions that I inspected the efficiency of the BK Tree searches, I noticed wildly differing percentages of the tree were searched, between 5% and 90%. One interesting result was that as I increased the value of k, many search’s efficiency dive-bombed, that is, instead of increasing the amount of the tree searched from 10% to 20%, it would increase from 10% to 80% or 90%. I suspect that the users are naturally clustered, and when k exceeds the size of this natural cluster, there are many equally bad results. It would be interesting to verify this by analyzing the dataset with an edge detection-based clustering algorithm.

Improving the Recommendations

Although I tried using both users and repos as the basis for the kNN groupings, I didn’t get around to combining the results of the two. I speculate that this approach might help overcome the weaknesses inherent in each approach. For example, some users in the test set only watched one or two repos. For these users, picking a “similar user” isn’t likely to be a good approach since we have so little information about them. Conversely, for users watching many projects, picking repos that are similar to the ones they already watch ignores the information about their preferences that we could take advantage of by basing the recommendations on similar users.

Aside from being sloooow, another problem with the pure kNN approach is that no effort is made to reduce noise in the dataset. One could imagine that users might have decided to follow a repo outside of their normal preferences on a whim, or they might not follow a big project, such as rails, even if they’re interested in it, because they hear about what’s going on with that project from other sources (the developers’ blogs and twitter accounts, etc.). Smoothing out that noise is likely to give better predictions. One promising way to do this is via Singular Value Decomposition. Be sure to also have a look at Ilya Grigorik’s post on the topic, if you haven’t already. This is the next item on my list to add to the Decider library; in fact, I’ve already found a suitable Java implementation of the whiz-bang linear algebra to use with Jruby.

Github Specific Approaches?

(This paragraph added on 30 Aug, 2009) Scott Chacon, in a post on the Github blog, expressed some dismay that it was possible to get quite a few correct answers (I’ve read that it’s between 15%–20%) by guessing that users are watching projects they’ve forked. I can agree with his sentiment to some extent, but another way of looking at it is that the project/fork tree is simply another “social graph” structure full of interesting and useful information. In an IRC conversation, careo discussed the idea of putting the data in a Neo4j graph database and generating recommendations using queries against the graph. One interesting avenue of research would be to augment the explicit relations (users following projects, projects forked from projects, etc.) with the implicit relationships found by the distance metrics and use queries against that graph to generate recommendations. The obvious issue with this approach is that different types of measurements are being treated as equal (or at least interchangeable), so I anticipate a good deal of tuning would be required to correctly balance the relative weights of each type of relationship.

Fin

The contest was pretty awesome, and I learned a whole lot in the process, even getting reacquainted with some math I haven’t used in a while. So, thanks are definitely in order for the Github crew for putting their dataset in the open and generating open source activity in a realm of computing that’s all too often sequestered in stuffy journals and corporate trade secrets. I’ll be updating this blog with more information and code examples as I continue to improve the decider, though hopefully it won’t be anything quite so long-winded.

Further Reading

Mea Culpa: I messed up the Tanimoto Coefficient calculations (for Alice & Chandra and Alice & Dave) in earlier revisions of this post. They should be correct now. My Bad.

August 11, 2009

Qusion Updated!

Filed under: amqp — Daniel DeLeo @ 11:24 am
Tags: , , ,

After a little trial-by-fire, I’ve updated Qusion, my “AMQP made easy” library/plugin/what-have-you. Since I last wrote about it, I’ve added/addressed the following:

  • The preferred way to run Qusion is now to call Qusion.start; in Rails, this goes inside an after_initialize block in your environment.rb, like this:
    Rails::Initializer.run do |config|
      # stuff...
      config.after_initialize do
        Qusion.start
      end
    end
    
  • YAML-based configuration. Create a file named config/amqp.yml, fill in your details, and you’re good to go. Environments supported.
  • Install Qusion as a Rails plugin, just like you’re used to: script/plugin install git://github.com/danielsdeleo/qusion.git
  • Qusion plays nice with Workling
  • Rails, from maybe 2.2-ish to about a week ago in edge, required thin when starting up via script/server as a workaround for some issue that’s long since been solved. This behavior tricked Qusion into behaving like it was running under thin when starting mongrel with script/server. This has been fixed, and the behavior is verified in the specs.
  • More servers are experimentally supported: Evented Mongrel, WEBrick, SCGI. However, I primarily test Qusion with Passenger, Mongrel and Thin.

Now, it would be a dick move if I didn’t mention the the few gotchas you might run into with Qusion, and I’m not a big fan of dick moves. First of all, for non-evented servers, that is, everything other than Thin and Evented Mongrel, Qusion runs EventMachine in a thread. As MRI doesn’t run threads in parallel, this means that your app is running on a single core and switching contexts back and forth between the threads. This overhead may or may not be something you can tolerate. Secondly, if your broker (RabbitMQ) goes down, the AMQP library typically loses a few messages and silently attempts to reconnect. Any messages sent while there is no TCP connection to the broker are buffered, but during a graceful restart of RabbitMQ, there’s a few seconds of shutdown time. During the shutdown process RabbitMQ doesn’t accept messages but maintains the TCP connections with clients, meaning that any messages published during this time are silently dropped. The only way around this currently is to use the Bunny AMQP client which isn’t supported by Qusion right now. Brightbox’s Warren plugin might be for you if you want to configure your AMQP settings with YAML and use the Bunny client.

Gotchas aside, if you’re interested in using the AMQP gem with your web app and you want everything to just work on a variety of Ruby web servers, Qusion is your new best friend.

July 11, 2009

AMQP + Phusion = Qusion!

Filed under: amqp — Daniel DeLeo @ 7:30 pm
Tags: , , ,

3672208065_fed4c44bc8_m.jpg

One thing that’s pretty obvious from the server logs for this blog is that using AMQP with Phusion Passenger should be easier. When the topic came up yet again on the amqp mailing list, I decided there needed to be an easier way. So I wrote a small library that takes care of getting AMQP running on any ruby app server, named it Qusion, and put the code on github.

Qusion!

Qusion is pretty small and simple, with only two real features. First, it uses a monkey patch to add an AMQP.start_web_dispatcher method. This method takes the same options as AMQP.start, and it takes care of setting up whatever threads and event callbacks are necessary to run EventMachine and AMQP on whatever webserver it’s running on. To use it in rails, just add something like this to your config/environment.rb:

  require "#{RAILS_ROOT}/vendor/gems/qusion/lib/qusion" # or whatever
  AMQP.start_web_dispatcher(:host => "localhost")

Taking Care of Your Rabbit

The other feature that Qusion provides is a pool of AMQP channels. AMQP channels are created every time you call MQ.new and they’re handled by individual Erlang processes inside RabbitMQ. The danger here is that if you create a new one for every request you serve, you can eventually cause big problems for RabbitMQ. To solve this problem, Qusion creates a pool of AMQP channels when your app starts. To use them you just need to call Qusion.channel instead of MQ.new. To illustrate:

# instead of this:
MQ.new.queue("my-work-queue")

# Do this:
Qusion.channel.queue("my-work-queue")

Harder Better Faster Stronger—Maybe

Qusion isn’t intended to do very much: it lives only to make it easy to use AMQP for deferring the time intensive tasks that can make your app feel slow. With that said, there might be some other pain points that would be a good fit for Qusion to address. If you come across any, feel free to git fork and hack or contact me if you’ve got a bug you can’t pin down.


Photo from the Fusion Festival 2009 by GuyInkognito CC BY-NC-SA

June 27, 2009

Could Google Wave Be a Panacea for Remote Pair Programming?

Filed under: ask the community — Daniel DeLeo @ 12:19 pm
Tags: , , ,

Gemini 1 by agent smith on flickr

Pair programming is hailed by agile adherents as an effective way to both implement and reinforce other agile principles. Code reviews? How about instant, real-time code reviews? Collaboration and teamwork? Why not do it constantly? Training team members and sharing knowledge? Make it part of your normal workflow. Moreover, Wikipedia’s take on pair programming references controlled studies demonstrating striking efficiency gains over independent solo programming. What has been consistently challenging about pair programming is how to do it remotely.

Ignoring the human interaction component, the solutions for remote collaborative editing come down to two approaches: one is screen sharing using remote desktop software, as Pivotal’s Chad Woolley does; the other is to use a purpose built collaborative editor. Both of these are suboptimal solutions. What if I use TextMate but my pair is a Vim addict? And another developer hops in the mix wanting to use emacs? Agreeing on one will make some of us uncomfortable; using a different editor entirely might make all of us uncomfortable. Sharing a screen also strains bandwidth which is especially precious if you’re trying to videoconference at the same time on the same circuit.

Is Google Wave the Answer?

I’m not in on the beta, and I’ve so far only skimmed the protocol, but I’m thinking Google Wave might be the answer. Now, Wave has a big pile of features, but it’s ability to show character by character typing (is it multicursor?) by any of the participants is the killer feature for this purpose.

One way to view what I’m proposing here is just an application of the traditional way programmers solve complex problems: delegate each part of the problem to a smaller component. We don’t make web apps out of loadbalancerappserverdatabases; we make apps out of load balancers, app servers, and databases. So why not have an editor and a collaboration server instead of collaborativeeditor?

Challenges

There are some hurdles to be cleared before we can create this pairing nirvana, most obviously:

  • Wave is focused on a messaging type paradigm, whereas we want a more document style paradigm
  • You might not want to use Google’s servers
  • Most editors have a hidden assumption that you’re editing a file on the filesystem, and it’s not being changed by another user or process while editing

These problems aren’t insurmountable. The Wave-centric challenges can be overcome by creating our own Wave server; this is almost guaranteed to happen anyway. It might be helpful to add a virtual host concept, similar to RabbitMQ’s vhosts, to our Wave server, and set up one per project.

As for the editors, I’m the optimistic type who always thinks it must be possible. The flip side of this is that I haven’t written any plugins and don’t know what APIs are available for arbitrating text. For this solution to work and be awesome, we’ll need text arbitrating plugins with external or built-in Wave clients for at least vim, TM, and emacs, ideally also for the newer IDEish beasts that are (finally) starting to (I hear) be really useful for Ruby and Ruby-centric projects.

First!! Wait, What?

Thinking about the possibilities that might open up if we had a project like this, I started to consider the advantages of integrating Mozilla’s browser-based bespin editor as a fallback option if nothing else, and I checked their blog. Lo and behold, they’re already way ahead of me, envisioning a real-time github sort of experience for editing. If you need a little encouragement to get pumped about this idea, look at this:

Oh, the possibilities!

It seems inevitable that, given Wave’s features, even more cool uses could be found than simple pairing. Here’s some suggestions:

  • Working with your designer remotely (or not!), she attaches wireframes as images to the project and you work together to create your view templates.
  • Continuous and automatic versioning. Wrote some good code and some crappy code since your last commit? Rewind!
  • Your pair (sitting right next to you) is an awesome coder who insists on an ugly theme or that editor you can’t (under)stand? No big deal.
  • Hard drive blew up and need to commit a fix right now? Get your bespin on!
  • You think of something. No, really.

Parting Thoughts

I try to back up most of my posts here with solid code, but in this case I’m a bit out of my element. Ideally some of you will talk about the following things in the comments or on twitter or whatever you kids are using these days:

  1. Can plugins arbitrate text in the editors I’ve mentioned? In your top favorite editor/IDE/code thingy?
  2. What’s a super cool awesome thing we could do with this that I haven’t thought of?
  3. You have super secret Google access or some other method of building a compliant Wave server without having to wait for Google to end their closed beta. (Looks like our Python friends are beating us to it! PyGoWave Server)

Photo “Gemini 1″ by Flickr User Agent Smith (Jonas Smith), CC By

June 21, 2009

Introducing Moqueue

Filed under: amqp — Daniel DeLeo @ 12:53 pm
Tags: , , , , ,

Rorshach Test 1 by _hadock_ via flickr

TATFT is awesome. AMQP is awesome. If they could be awesome together at the same time, that’s like the dream team. If you can’t get them on the same team, then it’s annoying, like those Kobe and Lebron puppet commercials. The biggest obstacles to creating the dream team, a.k.a. easily testing amqp code, are 1) that amqp objects such as queues and exchanges have relationships to each other that are fairly complicated and don’t lend themselves to simple mocking; 2) the way that AMQP requires a connection to the server (broker) for many of its common operations; and 3) even for simple cases, such as testing code provided to Queue#subscribe, the way amqp shuffles blocks of code around adds a layer of complexity that can be tough to deal with. Given all of this, it’s clear that what’s needed is a way to mock-out the various classes of objects, such Queues, Exchanges, etc., so that they function as normally as possible without requiring a connection to an AMQP broker and give insight into whether or not they’ve received a certain message, whether or not they’ve acknowledged it, and so on.

Moqueue FTW

If you read the title of this post, there won’t be much suspense about what the solution to this problem is. So, without further ado, let me introduce Moqueue, a companion library to amqp (pronounce that Moe-queue, it’s a portmanteau of mocha and queue). Moqueue aims to provide mock objects with an API identical to that provided by real amqp objects, and enough functionality to make code built on amqp run without modification. Moqueue’s test doubles also keep a history of messages received and acknowledged, making it easy to assert that code you’ve built on top of amqp sends and receives the correct messages. Moqueue, is, no surprise, available on github. I’ll make a gem available soon, but for now you gotta use the source. Moqueue doesn’t have any dependencies aside from what’s required by the AMQP library itself. So git yer clone on (git clone git://github.com/danielsdeleo/moqueue.git), and take a quick tour of what Moqueue has to offer.

All the World’s a Nail, but I Only Have This Shotgun

One of Moqueue’s goals is to allow you to mock out amqp at any level. At a very fine level, you can use your favorite mocking library to replace individual queues and exchanges that your application might create. At the course level, Moqueue provides an overload_ampq method, which, as you would expect, overloads methods on AMQP and MQ to return Moqueue’s test doubles instead of real objects. Let’s start there with a little irb exploration:

require "moqueue"
overload_amqp

mq = MQ.new
#=> #<MQ:0x1197ae8>

queue = mq.queue("mocktacular")
#=> #<Moqueue::MockQueue:0x1194550 @name="mocktacular">

topic = mq.topic("lolz")
#=> #<Moqueue::MockExchange:0x11913dc @topic="lolz">

queue.bind(topic, :key=> "cats.*")
#=> #<Moqueue::MockQueue:0x1194550 @name="mocktacular">

queue.subscribe {|header, msg| puts [header.routing_key, msg]}
#=> nil

topic.publish("eatin ur foodz", :key => "cats.inUrFridge")
# cats.inUrFridge
# eatin ur foodz

queue.received_message?("eatin ur foodz")
#=> true

What Just Happened?

This example uses the coarse-grained “mock everything” approach, which we enabled with the #overload_amqp call. Behind the scenes, Moqueue required a file (overloads.rb) that monkey patches the AMQP module and MQ class to use Moqueue’s test doubles for everything. In the next few lines, we see calls to MQ#queue and MQ#topic return a MockQueue and MockExchange. After this, we see the full power of Moqueue. Moqueue has provided us with the ability to bind queues to topic exchanges, just like we do normally. After that, we can publish to the exchange and have the message delivered to the queue, just like normal. On the inside, Moqueue implements the topic exchange routing algorithm, and delivers the message only to the right queues. For example, let’s publish a non-lolcats message:

topic.publish("good grammar is not lolz", :key => "notLolCats")
# queue does not receive message

At the end of the example, we see that the queue has a record of received messages, and allows us to interrogate it about whether it received a particular message. For Test::Unit lovers, it should be pretty apparent that you’re just a assert queue.received_message?("I love T::U!") away from testing your amqp-based code with ease.

I Prefer a Scalpel

If going wild and mocking the whole universe at once isn’t for you, you can use a more surgical approach. The example below is taken right from the spec/examples directory. With some imagination, you should get an idea of how you could use your favorite mocking/stubbing library to replace real Queue and Exchange objects with Moqueue’s test doubles.

require File.dirname(__FILE__) + '/../spec_helper'

describe "AMQP", "when mocked out by Moqueue" do

  before(:each) do
    reset_broker
  end

  it "should have direct exchanges" do
    queue = mock_queue("direct-exchanges")
    queue.publish("you are correct, sir!")
    queue.subscribe { |message| "do something with message" }
    queue.received_message?("you are correct, sir!").should be_true
  end

  it "should have direct exchanges with acks" do
    queue = mock_queue("direct-with-acks")
    queue.publish("yessir!")
    queue.subscribe(:ack => true) { |headers, message| headers.ack  }
    queue.received_ack_for_message?("yessir!").should be_true
  end

  it "should have topic exchanges" do
    topic = mock_exchange(:topic => "TATFT")
    queue = mock_queue("rspec-fiend")
    queue.bind(topic, :key => "bdd.*").subscribe { |msg| "do something" }
    topic.publish("TATFT FTW", :key=> "bdd.4life")
    queue.received_message?("TATFT FTW").should be_true
  end

  it "should have topic exchanges with acks" do
    topic = mock_exchange(:topic => "animals")
    queue = mock_queue("cat-lover")
    queue.bind(topic, :key => "cats.#").subscribe(:ack => true) do |header, msg|
      header.ack
      "do something with message"
    end
    topic.publish("OMG kittehs!", :key => "cats.lolz.kittehs")
    topic.received_ack_for_message?("OMG kittehs!").should be_true
  end

end

The Inevitable Safety Lecture, Part 2358: The Dangers of Mocks

One thing that will be pretty obvious if you’ve used amqp much is that the above examples worked without using EventMachine. For the most part, this is a win, as it makes tests much simpler. The downside, which should be fairly obvious, is that this makes it possible for all of your tests to pass even if your code doesn’t actually run. To mitigate this danger, make sure you run higher-level tests without any mocking. You probably already know that cucumber is awesome for this.

Where It’s At

What it Does Well

As we have seen, Moqueue supports topic exchanges and direct exchanges. Acknowledgements are also supported. With topic exchanges, there’s a small gotcha in that you must call Queue#subscribe before publishing in order for the acknowledgement to be propagated back to the exchange. Direct exchanges are able to record acks regardless of the order of publishing/subscribing. Moqueue has a MockBroker class that keeps references to topic exchanges and queues, ensuring that you get the exact same object if you create a queue with the same name twice or if you create an exchange with the same topic twice. For example:

mock_queue("wibble").object_id == mock_queue("wibble").object_id
# => true

mock_exchange(:topic=>"ruby").object_id == mock_exchange(:topic=>"ruby").object_id
# => true

Although tremendously helpful, this can create dependencies between separate tests if you’re not careful. To avoid this, just add a call to reset_broker in your before(:each) (or equivalent) setup code.

What it May Never Do

Between the intricacies of EventMachine, the way AMQP brokers in general (and RabbitMQ in particular) work, and other factors, there are some particular behaviors you should be aware of that Moqueue does not and might not ever emulate. The first of these is the way that EventMachine schedules message publishing. In my last post, I spent some time showing how EventMachine can build up messages without being able to publish them if you call publish within a loop or iterator. Moqueue is blissfully unaware of EventMachine for the most part (it starts up the EM reactor in the overloaded version of AMQP.start, but that’s all), so calls to MockExchange#publish and MockQueue#publish are blocking–messages are published exactly when you call #publish.

Another behavior that Moqueue does not correctly emulate is the way that real AMQP brokers handle connections and acknowledgments. The amqp library, for instance, requires a graceful shutdown when using acknowledgements in order to correctly notify the server of pending acks. Moqueue doesn’t even have a proper concept of a connection, and handles all acknowledgements immediately, so it cannot mimic the real behavior in this case.

Finally, one of the most commonly discussed issues on the amqp list is the behavior when a client subscribes to a queue containing a backlog of messages. This behavior is most commonly noticed when chaining subscribes, i.e., a message is processed in the subscribe block, then published to another queue. This gist is a good example. The actual behavior of amqp in this situation is to process all of the messages from the first queue, then publish the messages to the next queue, and so on. When using RabbitMQ 1.6 (recently released as a beta), this behavior can be modified by using a “prefetch” option to specify the exact number of messages you want to be sent to a queue at one time. Moqueue doesn’t understand any of this. With Moqueue, a backlog of messages will be processed as soon as a subscribe block is given, however, because all of Moqueue’s #publish calls are blocking, the messages will be passed to subsequent subscribe blocks immediately.

The moral of the story here is that Moqueue is at its best when helping you test small isolated chunks of code: you know, unit testing. Even though one of my primary motivations for making Moqueue so elaborate was to help test the subscribe-then-publish case, keep in mind that Moqueue cannot be relied upon to provide an accurate representation of the global behavior of your app when run in production.

What it Should Do, But Doesn’t Yet

Moqueue started as one of those over-a-weekend coding storms. The common feature of software created this way is usually that it’s missing some obvious features. For Moqueue these are:

  • Rspec Matchers. I want to write queue.should have_received_message("msg txt")
    Update: DONE. See it in action. (Scroll down a bit)
  • Support for fanout exchanges. You can probably fake this using MockExchange#attach_queue, but it would be tedious.
    Update: DONE. You’ll find examples in spec/examples/basic_usage_spec.rb and spec/examples/logger_spec.rb
  • Support for RPC exchanges would also be cool.
  • Mocha style expectations on queues, such as queue.expects_message("hello, world!")
  • Possibly the most glaring deficiency is that Moqueue doesn’t support Queue#pop, only Queue#subscribe.

Of course, comments, compliments, hate mail, and, most importantly bug fixes are welcomed. Got fork?

Happy TATFTing!

May 15, 2009

Basic Topic Publishing with AMQP

Filed under: amqp — Daniel DeLeo @ 8:30 pm
Tags: , ,

telegraph keyRemember the buzz surrounding Erlang when the Pragmatic Programmers released Joe Armstrong’s book? If you’re like me, you bought the book, tried Erlang some, and maybe built a few toy apps, then found that Erlang’s quirks are hard to get over when the problem domain isn’t well tailored to Erlang’s strengths. Now, however, when your ruby program blocks the web server with work that could be done later or maxes out just one core out of two or four or sixty-four, you get back that lovin’ feeling for Erlang’s message-passing, massively parallel charms. One solution is to show off your mad polyglot skillz; JRuby, being GIL-less, is another, but the one we’ll discuss here is AMQP. Using AMQP to pass messages between processes, we can accomplish feats of magic such as processing two weeks worth of data in a day without ever leaving the comfort of ruby.

The Guts

Note: I’m presenting an over-simplified view of AMQP here. If you’re already a pro, skip this. If you’re a noob, keep in mind that I’m just presenting the bare minimum to get you started.

The-amqp-model-for-wikipedia.png

AMQP consists of a few main parts. The server that accepts messages and gets them to their destination(s) is called a broker. Within the broker, we have (very simply):

  • Queues: As you would guess, queues hold messages until we are ready to use them.
  • Exchanges: Exchanges are responsible for handling the messages we send to the server. Exchanges act as a kind of router, passing the messages we send them to one or more queues.
  • Bindings: Bindings are like routes for exchanges; they create a relationship between a single queue and and a single exchange.
  • Connections and Channels: these deal with our physical and logical connections between client and server. You’ll eventually want to learn about them, but the ruby AMQP libraries do a good job of abstracting them, so we’ll skip them for now.

By combining these pieces, we can pass messages in the following ways:

  • One to One: A message is passed from a single producer to a single consumer.
  • One to N: A message is passed from a single producer to many consumers.
  • One to One-of-N: A message is passed from a single producer to one consumer out of a pool.
  • Any of the above with multiple producers (N to One, N to N, etc.)

When you’re ready to learn more about AMQP, definitely read the OpenAMQ project’s background paper.

AMQP in Ruby

To make the following examples work, you’ll need Erlang, RabbitMQ, and the EventMachine, AMQP, and Bunny gems. Getting a decent Erlang install on OS X was quite a bear when I did it last, but I unfortunately didn’t keep any notes. Google is your friend here. RabbitMQ can be obtained with a simple

sudo port install rabbitmq-server

(I’m not sure but I believe Ubuntu users can simply substitute aptitude or apt-get for port)
and you can get the gems from rubyforge (EventMachine is a prerequisite for the amqp gem):

sudo gem install amqp bunny

Topic Publishing

The examples here will all deal with topic exchanges. Topic exchanges route messages based on simple keyword matching, with keywords separated by dots. Using the canonical stocks example, we might have keys of the form country.stock-exchange.company like us.nasdaq.aapl and us.nyse.ge. When we subscribe, we get Apple’s stock price by binding our queue to us.nasdaq.aapl or we could get all US stocks by binding to us.* or all NASDAQ stocks by binding to us.nasdaq.* There are a few other types of exchanges, including RPC, direct exchanges, and fanout exchanges. Be sure to check out amqp’s and bunny’s examples to get started with these.

To keep it simple, we’ll be publishing simple counters with timestamps as strings. Note that we can only use string messages with AMQP. If you need to send arbitrary ruby objects, you’ll need to use Marshal or JSON (or YAML) to serialize them to/from strings.

To begin make sure RabbitMQ is running:

sudo rabbitmq-server start

If you’re on Ubuntu, apt-get/aptitude might have done this for you, so don’t worry about errors.

Now you should be able to run this example:

require "rubygems"
require "eventmachine"
require "mq"
require "amqp"

# comment out one of these to pick communication model
PUBLISH_MODE = :one_to_n
#PUBLISH_MODE = :one_to_one_of_n

START_TIME = Time.new

def finish
  EM.forks.each {|pid| Process.kill("KILL", pid)}
  EM.stop { AMQP.stop }
end

def gen_queue_name
  if PUBLISH_MODE == :one_to_n
    rand(2 ** 16).to_s(16)
  else
    "foobarbaz"
  end
end

def timestamp
  delta_t = Time.new - START_TIME
  delta_t.to_s
end

EM.fork(3) do
  mq = MQ.new
  qname = gen_queue_name
  puts "starting queue ``#{qname}'', PID: #{Process.pid.to_s}"
  q = mq.queue(qname).bind(mq.topic("test"), :key => "test.key")
  q.subscribe do |msg|
    puts "[queue #{qname} (pid #{Process.pid.to_s})] received msg: #{msg} @ #{timestamp}"
  end
end

AMQP.start do
  i = 0
  topic = MQ.new.topic("test")
  EM.add_periodic_timer(0.1) do
    i += 1
    topic.publish("[#{i}] " + timestamp, :key => "test.key")
    finish if i >= 10
  end
end

The code begins by forking three new processes with EM.fork. This is essentially just a wrapper for Kernel.fork that additionally runs the provided block inside EM.run. Inside the forks, we create a new queue, bind it to the topic “test”, and filter for “test.key”. Then we start subscribing to the queue so we can receive the messages. The important part here is how we switch between a one to N or one to one of N communication model. When all of the subscriber processes use unique queue names, they have unique queues and each queue gets a copy of the message. When the subscribers all share the same queue name, they are subscribed to one shared queue, and each message will be received by only one the subscribers. Also note that, through the magic of EventMachine, we don’t need to explicitly poll for messages. When they arrive, EventMachine takes care of executing the code in the subscribe block.

Now that our subscribers are running, we set up our publisher. Here we’re explicitly using AMQP.start to start our connection to RabbitMQ. Our subscribers called this method implicitly when we called MQ#queue. Note also that we’re placing our publisher code in a block given to AMQP.start, but this isn’t strictly necessary: we could have called AMQP.start without a block or even omitted it and let MQ#topic call it implicitly. If we had done this, we would need to run the publisher code as a block given to EM.run.

For our publisher, we’re using EventMachine’s add_periodic_timer to schedule publishing every 1/10th of a second. This helps avoid some quirks that we’ll look at later. Note that we’re using Topic#publish here and ignoring queues completely.

When you run the example with PUBLISH_MODE == :one_to_n you should see something like this:

starting queue ``3170'', PID: 2912
starting queue ``62b2'', PID: 2910
starting queue ``9828'', PID: 2911
[queue 3170 (pid 2912)] received msg: [1] 0.249223 @ 0.252471
[queue 62b2 (pid 2910)] received msg: [1] 0.249223 @ 0.252466
[queue 9828 (pid 2911)] received msg: [1] 0.249223 @ 0.253655
[queue 62b2 (pid 2910)] received msg: [2] 0.430483 @ 0.4334
[queue 3170 (pid 2912)] received msg: [2] 0.430483 @ 0.433398
[queue 9828 (pid 2911)] received msg: [2] 0.430483 @ 0.434498

...

[queue 3170 (pid 2912)] received msg: [9] 1.698417 @ 1.702544
[queue 62b2 (pid 2910)] received msg: [9] 1.698417 @ 1.702538
[queue 9828 (pid 2911)] received msg: [9] 1.698417 @ 1.703674

With PUBLISH_MODE == :one_to_one_of_n You’ll see:

starting queue ``foobarbaz'', PID: 4911 at 0.00475
starting queue ``foobarbaz'', PID: 4910 at 0.005266
starting queue ``foobarbaz'', PID: 4909 at 0.016942
[queue foobarbaz (pid 4911)] received msg: [1] 0.355391 at 0.358376
[queue foobarbaz (pid 4909)] received msg: [2] 0.53673 at 0.539733

...

[queue foobarbaz (pid 4909)] received msg: [8] 1.623612 at 1.626233
[queue foobarbaz (pid 4910)] received msg: [9] 1.804772 at 1.807327

Blocking EventMachine

In the example so far, we’ve used EM.add_periodic_timer to ensure that we published messages in an orderly way. For many use-cases, however, this won’t be the case. For example, if we have all of our input data available before we start publishing, we might be tempted to use Enumerable#each, like this:


AMQP.start do
  topic = MQ.new.topic("test")
  (1..10).each do |i|
    topic.publish("[#{i}] " + timestamp, :key => "test.key")
    sleep(0.1)
  end

  # Give the subscribers time to catch up before killing them:
  EM.add_timer(2) do
    finish
  end
end

If we run this, we’ll see:

starting queue ``foobarbaz'', PID: 5438
starting queue ``foobarbaz'', PID: 5440
starting queue ``foobarbaz'', PID: 5439
[queue foobarbaz (pid 5438)] received msg: [1] 0.070594 at 1.084451
[queue foobarbaz (pid 5438)] received msg: [4] 0.37131 at 1.085436
[queue foobarbaz (pid 5440)] received msg: [2] 0.17085 at 1.085788
[queue foobarbaz (pid 5438)] received msg: [7] 0.671874 at 1.086368
[queue foobarbaz (pid 5440)] received msg: [5] 0.471554 at 1.086662
[queue foobarbaz (pid 5438)] received msg: [10] 0.972494 at 1.087264
[queue foobarbaz (pid 5440)] received msg: [8] 0.772093 at 1.087417
[queue foobarbaz (pid 5439)] received msg: [3] 0.27107 at 1.088342
[queue foobarbaz (pid 5439)] received msg: [6] 0.571704 at 1.088939
[queue foobarbaz (pid 5439)] received msg: [9] 0.872309 at 1.089509

Notice how all of the messages are received at about the same time? This happens because the sleep is blocking EventMachine, so it doesn’t get a chance to publish a message before #each moves on to the next message. In extreme cases, our connection to RabbitMQ might timeout while waiting for messages, causing errors when we get around to actually publishing them. There are few ways to work around this, but the simplest is to use a synchronous (non-Event Machine) AMQP library for publishing.

Bunny Hop

My personal choice of synchronous AMQP library is bunny. Here’s how we’d use bunny to fix our problem with the previous example:

require "rubygems"
require "eventmachine"
require "mq"
require "amqp"
require "bunny"

#PUBLISH_MODE = :one_to_n
PUBLISH_MODE = :one_to_one_of_n

START_TIME = Time.new

def finish
  EM.forks.each {|pid| Process.kill("KILL", pid)}
end

def gen_qname
  if PUBLISH_MODE == :one_to_n
    rand(2 ** 16).to_s(16)
  else
    "foobarbaz"
  end
end

def timestamp
  delta_t = Time.new - START_TIME
  delta_t.to_s
end

EM.fork(3) do
  mq = MQ.new
  qname = gen_qname
  puts "starting queue ``#{qname}'', PID: #{Process.pid.to_s}"
  q = mq.queue(qname).bind(mq.topic("test"), :key => "test.key")
  q.subscribe do |msg|
    puts "[queue #{qname} (pid #{Process.pid.to_s})] received msg: #{msg} @ #{timestamp}"
  end
end

connection = Bunny.new
connection.start

topic = connection.exchange("test", :type => :topic)
(1..10).each do |i|
  topic.publish("[#{i}] " + timestamp, :key => "test.key")
  sleep(0.1)
end

finish
connection.stop

The API for bunny has a few differences from the amqp library API, but is also very simple. The examples are also quite helpful.

The main difference with using bunny to publish the messages is that topic.publish blocks the execution of #each until the message is actually published to the server, whereas amqp is using EventMachine’s magic to buffer the messages and continue execution right away. There are advantages to each approach, of course, but as you can see in these examples, using the blocking, synchronous library is in many ways the easier approach. Another “+1″ for this approach, as I mentioned in a previous post, is that using EventMachine within Phusion Passenger can lead to grumpiness, or at least snarky comments on the internet. For example, if you are doing something like generating large PDFs and emailing them to users based on actions in a web app, you could use AMQP to send the relevant parameters to a separate process that does the PDF generating to avoid blocking the web server. Using bunny instead of amqp in your web app code lets avoid the special engineering required to get passenger and EventMachine to play nice with each other.

Resources

  • Download the examples: Demo #1, Demo #2, Demo #3
  • If you have problems with the examples, give me a shout in the comments, or on twitter.
  • The readme for amqp links to a huge amount of AMQP and messaging resources, including some classic critiques of the RPC paradigm.
  • Follow Bunny development at its github page.
  • Take a look at nanite if you’re interested in a more framework-ey approach.
  • This in-depth description of AMQP is worth linking to again. Skim it a few times, then give it a full read once you’ve groked the basics.
  • (Update) I wrote a library that provides test doubles for the AMQP module and MQ classes. If you’re interested in doing TDD with AMQP, give it a try: CODE | WORDS
  • (Update) I also wrote a library to make AMQP (the EventMachine one) work with web apps more easily. It checks to see if it’s running in Passnger, Thin, or Mongrel and takes care of setting up whatever worker threads and/or callbacks are needed to make AMQP “just work.” WORDS | CODE

Telegraph Key picture by Flickr user photobunny (synchronicity, baby!). CC by-nc-nd
AMQP diagram by Tony Garnock-Jones for RabbitMQ via Wikipedia.

May 2, 2009

A Few Notes on AMQP

Filed under: amqp — Daniel DeLeo @ 7:21 pm
Tags: , , ,
  • When testing with tmm1’s amqp, don’t forget to call AMQP.stop in your after() or teardown hooks: not calling AMQP.stop can cause strange failures. One symptom of this problem is that using rspec’s -R (reverse order) option will change which examples are failing.
  • Anyone who writes a mocking library for amqp will be my hero. It would be awesome to have something like this work without the broker running:
    describe AMQPcode do
    
      it "should publish to a queue" do
        @queue.publish("messaging FTW!")
        @queue.should have(1).published_messages
      end
    
      it "should bind the queue to topic X with key Y.Z" do
        @queue.should be_bound_to(:topic => "X", :key => "Y.Z")
      end
    
    end
    

    Update: I’m my own hero! Get Moqueue from github and read my introduction.

  • Bunny is my favorite synchronous ruby AMQP library. With the newest release, it can be used simultaneously with the evented amqp library. This is a big plus if you have back-end code using amqp, but you want to use a synchronous library in a web app and you are testing integration between the two. Could also be helpful in avoiding the issues and workarounds inherent in using EventMachine within Passenger. Bunny is super-simple to use, see for yourself in the examples.

April 25, 2009

Testing Evented and AMQP Apps with EM-Spec

Filed under: amqp — Daniel DeLeo @ 4:07 am
Tags: , , , ,

Argon-ion laser launches into fiberI was recently working on some example code to demonstrate amqp and got frustrated with the impedance mismatch betweenEventMachine’s asynchronous nature and rspec’s assumption of synchronous test code. I had something seemingly working, but it was flaky and would fail examples seemingly at random. I got some nice tips from the mailing list pointing me to em-spec, an async-enabled BDD library. The Rspec support was initially pretty lacking, but I was able to submit some patches to get it usable. Since then I’ve made a bunch of improvements on my own fork. Hopefully these patches will get pushed upstream soon. Until then, if you want to follow along with the examples below, be sure to grab my version.

This Ain’t (Exactly) Yer Dad’s Rspec

…but it’s close. There are a few differences:

  • Your examples magically run inside an EM.run block. Hooray! Unfortunately, so do your before() and after() blocks. We’ll see the consequences below.
  • Your examples are run in a Fiber. On pre-1.9 Rubies, fibers are emulated with Threads.
  • You’ll need to run the #done method at the end of your test to tell the fiber that your example is done with it.

Getting Started

Let’s begin with a simple example, taken from em-spec’s own specs.

describe EventMachine do
  include EM::Spec

  it 'should have timers' do
    start = Time.now

    EM.add_timer(0.5){
      (Time.now-start).should be_close( 0.5, 0.1 )
      done
    }
  end
end

To enable the evented behavior, we start by including EMSpec. The first example shows some basic EventMachine behavior, using a timer to schedule code to run at some future time. The first thing to note is that we need to call #done to keep the example from running its fiber forever. Also, note that both our expectation (#should) and #done are called inside the add_timer block. When testing evented code, you’ll find that this is the easiest and most natural way to go. If we had instead placed the call to #done outside of the add_timer block, we would stop execution of the example before the expectation runs:

  it 'should not be used like this' do
    # Really, this is the wrong way!
    start = Time.now
    EM.add_timer(0.5){
      (Time.now-start).should be_close( 12345, 0.1 ) #never runs!
    }
    done
  end

# Example Passes (OOPS!)
# EventMachine
# - should not be used like this
#
# Finished in 0.004999 seconds
#
# 1 example, 0 failures

The moral of the story is to pay attention to placing your expectations and #done in the right place. If you’re adding tests to existing code (lecture omitted), be sure your tests can fail by putting bogus values in the matchers and running your specs.

All of the other EventMachine behaviors, such as periodic timers and deferrables work similarly:

  it 'should have periodic timers' do
    num, start = 0, Time.now

    timer = EM.add_periodic_timer(0.5){
      if (num += 1) == 2
        (Time.now-start).should be_close( 1.0, 0.1 )
        EM.__send__ :cancel_timer, timer
        done
      end
    }
  end

  it 'should have deferrables' do
    defr = EM::DefaultDeferrable.new
    defr.timeout(1)
    defr.errback{
      done
    }
  end

Gotchas

Unfortunately, the way that em-spec is monkey patched into Rspec makes your #before() and #after() blocks run inside their own EM-enabled fibers. For example, if you use #before(:each), you need to top off your code with #done:

before(:each)
  @test_object = Something.new
  done
end

Additionally, each before block will run inside its own personal EM.run block, and the EventMachine reactor is stopped when #done is called. This means that you can set instance variables based on the result of evented code in a before block, but any state set up within the reactor will be lost. This is particularly important when testing AMQP applications.

If this bothers you, or only a few of your expectations use evented behavior, you can use EMSpec as a helper. You’ll have to explicitly wrap evented code inside a block you pass to the em method:

describe EventedCode do
  include EM::SpecHelper

  it "should do something evented" do
    em do
      # Your test here
    end
  end

  it "does something not evented" do
    # ...
    @one_thing.should == another_thing
    # No need to call done
  end
end

When running evented tests, it’s a good idea to use rspec’s timeout ( -t ) option. This will automatically fail your tests if you botched something and #done never gets called.

AMQP and EMSpec

You can use EMSpec to test applications using Aman Gupta’s amqp library; in fact, that’s why I got started playing with it in the first place. For example, a test using a direct exchange looks like this:

  it "should have direct exchanges" do
    q = MQ.new.queue("example")
    q.publish("hi mom!")
    q.subscribe { |msg|
      msg.should == "hi mom!"
      done
    }
  end

Once again, placement of the call to done is crucial. Here, we’ve put it in the subscribe block to ensure that it only runs after amqp has received the message.

When testing amqp code, if you have several tests that all use amqp, you may encounter an error message like this: NOT_ALLOWED - attempt to reuse consumer tag 'another-example-635559437038' in AMQP::Protocol::Basic::Consume. I don’t know what this is, but you can work around this issue by using an empty after block:

  after(:each) do
    done
  end

At this point, proponents of mocking and stubbing will certainly be jeering. “Your test code relies on the AMQP server, sacrilege!” they’ll say. I agree, but, as a matter of pragmatism, I’m putting up with this for now.

Go Forth and TATFT!

EMSpec makes testing evented code brain-dead simple. When using it in the helper flavor, you can even use it with cucumber and probably Test::Unit. If you’re developing evented code, you can’t say “we ain’t got no rspec” anymore.


The image at the top by fatllama, CC2 by-cc-sa

Next Page »

Blog at WordPress.com.