The future of MailRank's open source technologies

You've probably seen our exciting news about our upcoming move to Facebook.

It's been a total blast working on our product, and of course as we did so we released a number of open source libraries and tools. It only added to our pleasure to see so much of that code used outside of our own domain. I will continue to develop and maintain the code that we have released.

Here is a quick rundown of the code we have released, roughly ordered by significance. Yep, we wrote all of these projects in Haskell, definitely a decision that in retrospect I'm very happy about.

  • pronk (not yet actually released) is an application for load testing web servers. Think of it as similar to httperf or ab, only more modern, simpler to deal with, and with vastly better analytic and reporting capabilities.

  • configurator is a library that allows fast, dynamic reconfiguration of a Haskell application or daemon.

  • aeson is a JSON encoding and decoding library optimized for high performance and ease of use.

  • text-format is a library for printf-like text formatting.

  • mysql-simple is an easy-to-use client library for the MySQL database. It is several times faster than its competitors, and easier to use. It is built on top of the low-level mysql library.

  • riak-haskell-client is a client for the Riak decentralized data store.

  • blaze-textual is a library for efficiently rendering Haskell data as text.

  • double-conversion is a very fast library for rendering double precision floating point numbers as text, based on the code from the V8 Javascript engine.

  • resource-pool is a fast resource pooling library.

  • snappy provides Haskell bindings to Google's extremely fast snappy compression library.

  • base16-bytestring provides fast handling of base16-encoded data.

  • hdbc-mysql provides a MySQL transport for the HDBC database access library. (Yes, we recommend using mysql-simple instead!)

Thanks to all of you who have contributed patches and bug reports. It's going to be an exciting future!

Bryan at StrangeLoop 2011

Bryan has been at the Strange Loop 2011 conference in St. Louis the past few days. He ran a 3-hour "Intro to Haskell" workshop - and you can check out those slides here.

Slides for his talk - Running a Startup on Haskell - are available via github here. It hits the highlights of some of the decisions we've made at MailRank over the past several months... and some entertaining lessons learned in the process.

The right database, or the right database for now?

When we started working on our initial product, I was very interested in putting a newish distributed data store named Riak through its paces. It's a pretty solid piece of technology, but also somewhat new, so when we started down that road, I was far from convinced that we'd stick with it. And sure enough, we recently switched from Riak to MySQL. This is a brief tale of how that came about.

The good

Riak generally performs as advertised on the label: it's a bare-bones distributed key/value store, with little extra. It behaves quite solidly and predictably.

The not-so-good

What drove us to switch away from Riak was a mismatch that emerged between our needs and what it currently offers. Like many applications, we internally deal with not one, but a large number of collections of data, so we need to be able to manipulate those collections efficiently. The executive summary is that Riak does not currently provide a way to do this, so we were forced to drop it, at least for now.

The slightly longer version of this story is that the usual way to manage a collection is via an index. Riak supports exactly one index, which is by object name. The Riak engineering team is working on support for secondary indices, but I only found out about this by accident (if I were them, I'd be clarifying my public roadmap), and it's not clear when secondary indices will be available or whether they'll perform well.

It's possible to emulate a secondary index in Riak by storing a collection of object names as an object in its own right, but this is intrinsically slow due to the indirection it introduces, and managing a collection quickly becomes a bottleneck if it must be updated frequently. The first chart below tells the story: it illustrates the cost of adding an item to a collection as the collection grows in size. The blue line represents a hand-maintained secondary index in Riak, and ignore the others for now (they're what we're doing instead; read on).

Time-per-batch

That looks a little uncomfortable, but it's hardly surprising: our "hey, let's fake a secondary index" approach gets slower as the collection grows. The overall effect of those increasing costs should make us squirm some more: check out the blue line in the cumulative chart below. This shows the passage of time as we successively add items to a collection.

Cumulative-time

That's far from the kind of performance curve you can comfortably ship with.

Our initial response

As an initial proof of concept, we switched from Riak to MySQL, using the Haskell HDBC library to talk to MySQL. This improved our performance dramatically, since we can use MySQL's indices to manage our collections. MySQL can of course incrementally update an index, and this incrementality is really the part that's critical to predictably decent performance for our application.

I would be untroubled by Riak being much slower than MySQL (which it is) if I could manage collections with it. It has enough other nice operational features (and I've been burned in production enough times by MySQL) that it still looks pretty attractive. I'll be quite interested to follow Basho's work on secondary indices.

What we did next

The HDBC library (the green lines in the charts above) has been around quite a long time, and it's a little long in the tooth. I am unthrilled by a few aspects of working with it: its performance is poor; its API isn't particularly easy to use; and it's too easy to write code that's vulnerable to SQL injection attacks.

As a result, I wrote the mysql-simple library, which addresses all three issues:

  • Performance: simply switching from HDBC to mysql-simple improved the overall performance of our application by 50%. That's a big deal!
  • Ease of use: the amount of DB-related code in the application dropped by 30%, and the code became easier to write and to read.
  • Security: the library intentionally makes it difficult to construct arbitrary query strings, and automatically performs safe quoting of data.

And because no good thing should be enjoyed alone, we've released this library under the liberal 3-clause BSD license. Enjoy!

Introducing some open source technologies

I'm Bryan O'Sullivan, and Bethanye Blount and I cofounded MailRank at the beginning of December. We are busily heads-down creating a product that we're very excited about, and which we'll be ready to talk about more soon.

Like many startups, as we work on our product, we create a lot of "incidental" technology: we need it, but it's not some kind of special secret that we'd benefit from by keeping it under wraps. Since both of us have long backgrounds in the worlds of open source and open content, we are excited to be able to share some of our work early, in the hope that other folks out there will find it useful.

Innovative Riak bindings

We have built a client library for the Riak decentralized data store, from our friends at Basho Technologies. Our library focuses on high performance, flexibility, and correctness. It sports several features that we are pleased with:

  • Written in, and for, the powerful Haskell programming language
  • Uses the protocol buffer API, and some low-level networking tricks, to achieve high performance
  • Supports pipelining of requests, so you can issue multiple requests before receiving any responses
  • Uses the Haskell type system to support automatic resolution of vector clock conflicts

We think that this combination of features makes our Riak library particularly nice to work with.

If you're already using Haskell, you can install the Riak library with a single command:

cabal install riak

We've made the source code available as riak-haskell-client on github.

A faster JSON library

We like building sleek, performant code, so we wrote our own JSON library. Compared to the most popular existing JSON library for Haskell, ours is both a lot faster and uses more compact data structures. The library is named Aeson (after Jason's father in Greek mythology).

To install:

cabal install aeson

We've made the source code available as aeson on github.

The future

We hope to continue releasing more code as the opportunities arise. As our use of github implies, we are happy to receive bug reports and patches, and we'll act on them as time permits.

Have fun with our code! We hope you'll find it to be useful, solid, and well documented.

Posterous theme based on Proper by Cory Watilo