Running CouchDB against a RAMDisk

Struggling with a slow running test suite, I’d previously experimented with one db per test. Although superficially elegant, the fact that the dbs were pre-created meant that tests that should fail, passed, and vice versa.

Unsatisfied, I started looking for a different solution. CouchDB allows dbs to be created under a sub-dir, so creating a db named foo/bar will result in the following structure (assuming a default install)

  • /usr/local/var/lib/couchdb/foo/bar.couch
  • /usr/local/var/lib/couchdb/.foo/bar_design/my_design_doc.view

I mounted two RAM disks following the example in the man page for hdid (OS X).

DB_MOUNT=/usr/local/var/lib/couchdb/sd_test_rd
NUM_SECTORS=128000 # 2 * 1024 * Size in MB

RAM_DEV=`hdid -nomount ram://$NUM_SECTORS`
newfs_hfs $RAM_DEV
mkdir -p $DB_MOUNT
mount -t hfs $RAM_DEV $DB_MOUNT

A quick test with dd showed improved, if not stellar, performance. Running dd if=/dev/zero of=$DB_MOUNT/foo bs=1024 count=50000 took about 0.5s vs 3.5s against the HDD.

Unfortunately, this improvement wasn’t translated to CouchDB. Run time for my application’s test suite was virtually unchanged. I haven’t yet had the time to study why, but at a guess small frequent writes simply don’t see the same runtime improvement as large writes on a RAM disk.

I also tried a couple of alternatives, like the following, that resulted in no discernible difference.

  • diskutil erasevolume HFS+ "ram disk" `hdiutil attach -nomount ram://4629672`

I haven’t given up yet though. #couchdb’s manveru has been talking about an in-memory Ruby implementation of CouchDB. The nascent JS implementations of CouchDB, coupled with something like Helma or POW could also be ideal for frequent quick runs of a test suite.

Update: Spotlight chose to rebuild its indexes after I rebooted, presumably as a result of my device / volume mangling.

Posted by Paul Mon, 27 Apr 2009 10:01:00 GMT


Semi structuring CouchDB databases

A recurring question on the CouchDB mailing list and IRC channel is one of document structure. Most developers exploring CouchDB bring their SQL heritage along for the ride. Shifting your way of thinking doesn’t happen overnight, but using CouchDB effectively requires ditching the relational mindset.

Alex Lang’s recent Scotland on Rails talk about CouchDB and its Ruby libraries generated a little flutter on Twitter. At heart was his assertion that while you could take a relational approach when structuring your documents, to do so would be to miss the point. I concur. Given that the terminology around relational dbs and normalization is often abused, I’ll make what I mean explicit.

The relational in RDBMS has a well defined meaning that has nothing to do with multiple tables relating to one another. It means simply that a tuple has a value defined over it - the cells in a table row are related in some way and that relation has a meaning associated with it. That’s it.

A document oriented database has no real analog of a table and so is inherently non-relational. You could, of course, impose relational semantics by structuring your documents in a particular way. You could also normalize your structured documents, taking care not to assign an object or array to a document property, as doing so would preclude your database from being considered normalized (1NF). But why bother? Why not just use an RDBMS?

I’ll add, in case of ambiguity, that storing non-normalized data is completely orthogonal to storing doc ids as document properties, thereby allowing inter-document navigation.

So given that the relational model is effectively abandoned, how should document databases be structured? I’ll simply offer my own opinion, noting that many different approaches exist. (One of the more interesting deviations from the standard RDMBS model is the single db per user approach.)

  1. If new data supersedes existing data, should the existing data be kept (e.g. for analysis), or may it be deleted?

    In general, larger databases and views take longer to query. The difference is typically marginal, unless querying with a group_level > 0 in which case the repeated executions of the reduce function may become significant. A simple benchmark I carried out resulted in an avg 0.035s response time when querying against 1.6m docs with group_level=0. Querying with group_level=2 took 2.010s. It’s worth pointing out that I conducted the benchmarks in December; CouchDB improves rapidly.

  2. Should writes be contention free?

    Guaranteeing contention free writes means a new doc per write.

  3. How do you want to retrieve the data?

    While views offer great flexibility in aggregating data, an up-front understanding of how you’ll want to retrieve your data is beneficial. Issuing multiple requests to CouchDB and performing client side aggregation is a fine approach, but sometimes it’s simply easier to retrieve all the required data in a single request. This is particularly the case when paginating a result set. Trying to paginate across the combined results of multiple queries would be a real PITA.

  4. Judicious denormalization

    You can’t really denormalize something that wasn’t already normalized, but the idea is to embed unlikely-to-change properties in documents that aren’t the canonical source of that property.

    For example, an Invite may contain a recipient name and sender name, but also contain the doc ids of the docs representing the recipient and sender. This approach allows invites and their relevant presentational information to be retrieved with just a single CouchDB query, but it also allows for simple navigation to the referenced documents (recipient and sender). I make heavy use of this approach myself and RelaxDB supports it explicitly.

Posted by Paul Sun, 05 Apr 2009 11:12:00 GMT


Quick and Effective - Per test databases with CouchDB

Testing is surely the aspect of software development that I find most irksome. If the goal is an efficacious, fast-running test suite, it’s not an easy target and as a result, I’m almost never happy with my tests.

A few years ago, tired of a ‘unit test’ suite that took twenty minutes to run, and a functional test suite that took almost five times as long, I jumped on the mock bandwagon as it went by. I’ve since jumped off. Mocks have their uses, but they don’t typically stand the acid test of tests - “Broken interface means failing tests, Working interface means passing tests”. Almost axiomatic, but often ignored.

Which brings me onto model level tests and CouchDB. I’m only going to have test confidence when my models are operating on data retrieved from the database. That confidence comes at a price - my test suite is slow. But as I mentioned before, working with CouchDB encourages new ways of thinking.

Traditionally, tests that interact with a database clear it out before every test case, load it up with the required data and finally run the actual test. But a single CouchDB instance supports many databases, potentially many many databases. What if I had a database per test, pre-loaded with the required data? Tests would still issue GET requests, but the costly setup stage would be obviated. As it turns out, this is fairly straighforward to do. The basic premise is as follows:

  • Specify the test setup code in a lambda (or any delayed execution construct)
  • When the test starts, query CouchDB for a known document id against a database whose name matches the current test
  • If the known document exists and if it contains a property whose contents match the test setup code exactly, you’re done. Just run the test.
  • If the condition above doesn’t hold, the test setup code has changed. Delete the database named by the test, create it again, run the test setup code against it and store a document containing the test setup code. Now run the test.

The following is extracted from a real spec. The code inside the cdb block is executed if and only if it hasn’t already been run.


it "should know one another" do
  cdb do
    p1, p2 = Player.stock(:name => "p1"), Player.stock(:name => "p2")
    p1.acquaint p2
  end
  # p1 and p2 are methods that load players by names p1 and p2
  p1.should know(p2)
  p2.should know(p1)
end

It’s worth pointing out that you’ll want to rerun the lambda if the object creation code changes, even if the lambda itself is unchanged. This is done easily with a command line switch.

I’ve published the code I use for doing this with RSpec on github. The speed up is of course test dependent, but the specs that I’ve applied it to run almost an order of magnitude faster. Happy days!

Posted by Paul Sun, 18 Jan 2009 14:27:00 GMT


Visualizing inter doc relationships in CouchDB

One of the many enjoyable aspects of working with CouchDB is the scope it offers for exploration. Developing against CouchDB presents such a different paradigm for working with data that it really does stimulate thought.

I wanted to illustrate this at my talk at LRUG last week. I’d been thinking for some time how useful a graphical document browser would be for CouchDB, and that it would be fairly simple to write one. So, rather than preparing a talk, I spent last Monday writing fuschia. The idea was straightforward - a user enters a seed document id, fuschia displays it, all docs that it links to, and that link to it. Docs are represented by colored nodes, and labelled with their most descriptive attribute (as defined by the user). Click a node to repeat the process.

Given CouchDB’s HTTP interface, UUIDs as identifiers, and an ability to aggregate data with a map function, developing fuschia required just a few dozen lines of core code. Kudos also goes to prefuse, but my relationship with that library is more a love / hate one. Dragons lie on either side of its hidden golden path.

Somewhat predictably, writing fuschia took longer than expected and I didn’t have time to prepare a talk. Which wouldn’t have been so bad had I not forgotten to demo fuschia. Not my finest hour.

While fuschia works well with sample data, using it with real world data didn’t offer the insights I’d hoped for. An ability to filter out data is needed and views offer a natural way to achieve this. Work for the future so…

Posted by Paul Sun, 18 Jan 2009 13:16:00 GMT


Illegally embedding attributes for fun and profit

Imagine a web interface where you drag items into clusters. Dropping an item onto a cluster should make a request that persists the association. To do this, two ids are required - one for the item and one for the cluster. The most natural way to list these attributes would be on the HTML elements themselves. Maybe something like

<div class="item" item-id="478">idea</div>
<div class="cluster" cluster-id="112">group name</div>

(Let’s assume that we can’t simply use the id attribute - id collision or whatever). Unfortunately, the HTML above isn’t valid. Neither item-id nor cluster-id are recognized HTML attributes and the spec makes no allowance for adding arbitrary attributes. This is a real shame as any alternatives involve more markup, scripting or both.

Is knowing generation of invalid HTML such a bad thing? Well the W3C validator states that

Validity is one of the quality criteria for a Web page, but there are many others. In other words, a valid Web page is not necessarily a good web page, but an invalid Web page has little chance of being a good web page.

Yikes! Sounds quite imperious. So how does this little axiom stand up in the wild?

  • http://ajaxian.com/ is XHTML 1.0 Transitional with 682 Errors, 16 warning(s)
  • http://www.google.com/ is HTML 4.01 transitional with 66 errors, 9 warning(s)
  • http://www.facebook.com/index.php is XHTML 1.0 strict with 69 errors, 27 warning(s)
  • http://groups.google.com/group/jquery-en is XHTML 1.0 Transitional with 752 Errors, 122 warning(s)
  • http://developer.mozilla.org/ is XHTML 1.0 Transitional with 43 Errors, 36 warning(s)
  • http://validator.w3.org/ was successfully checked as XHTML 1.0 Strict

Ok, so I no longer care about validation. Read that advisably - I have no intention of producing tag soup and I’m fully aware that many of the errors above were caused by malformatted urls, but I’m not going to worry if my page fails validation.

What happens when the page is parsed and invalid attributes are encountered? Well, the spec merely makes a recommendation

If a user agent encounters an attribute it does not recognize, it should ignore the entire attribute specification (i.e., the attribute and its value).

In practice, however, browsers do not do this. A webkit blog post tells us that

Many technically illegal constructs, like misnested tags or bad attribute names, are allowed or safely ignored. This error-handling is relatively consistent between browsers.

This bodes well, but what of the future? Will HTML 5 outlaw or support arbitrary attributes? The news is good, embedded attributes are explicitly supported by HTML 5. They take the form data-*="". Let’s rewrite the attributes above so we’re at least consistent with HTML 5.

<div class="item" data-item-id="478">idea</div>
<div class="cluster" data-cluster-id="112">group name</div>

So, I’m explicity generating invalid HTML 4 and, DOCTYPE excepted, valid HTML 5. But the real question - does it work? I haven’t tested in IE6, because as we all know, it’s teh suck, but in other browsers all is well.

A test page consisting of 26 elements, each with two attributes, is consistently aggregated in less than 1ms by Safari. Firefox takes between 8-12ms and IE7 is a little slower, usually about 50ms.

The end result is simple markup that can be easily inspected with a minimum of fuss with a library like jQuery. For example, to establish the relationship between item and cluster, we’d merely write something like the following.

$.post("/relationship", {
    item_id : $draggedItem.attr("data-item-id"),
    cluster_id : $droppable.attr("data-cluster-id")
  }
);

Update: Scott Byers had problems with the selector syntax above - jQuery merely returned an empty set. He got around it by escaping the - character e.g. $("[data\-foo='bar']").

Posted by Paul Thu, 04 Sep 2008 08:55:00 GMT


Relax with Merb and CouchDB

I’ve posted a tutorial describing how to create a Merb app backed by CouchDB on strawberrydiva.com

If you’re running Rails rather than merb, you should still be able to follow along.

Posted by Paul Fri, 29 Aug 2008 15:34:00 GMT


DNS to lose relevance around the edges?

As noted elsewhere, the trend for url free advertisements for web sites is growing. Relying on high search engine rankings for your product name is clearly a risky business, but let’s assume you have it nailed. Do you still need to line the pockets of a domainer?

Well, DNS is often employed as a load balancing technique for heavily trafficked sites, so having search engine results direct to a domain would typically be a good thing. But that load balancing service could potentially be replaced with the help of a chunky EC2 instance sitting up front directing requests, and an elastic IP address -  an IP address which is dynamically reassigned e.g. in case of instance failure.

So, assuming that Google’s ranking algorithm doesn’t penalise domain free sites, how long before we see a website launch that simply ignores domains and DNS and relies exclusively on search engine ranking?

Posted by Paul Tue, 10 Jun 2008 07:36:00 GMT


Developers' License

In October of last year, the cavernous Turbine Hall of London’s Tate Modern opened to a work of art by Columbian sculptor Doris Salcedo. The work of art is simply a crack in the floor. Salcedo states that the fracture symbolises the gap between white Europeans and the rest of humanity. When asked how deep the crack is, she replied "It’s bottomless. It’s as deep as humanity." Now, to be honest, I don’t think it is. In fact, I can’t imagine it’s more than 40 or 50cm deep. Not that that’s stopped some people falling into it. Nonetheless, her reply made me wonder if we, as software developers, shouldn’t be afforded the opportunity to use our own form of artistic license… Developers' License I’m not convinced. Note: Possessing little artistic ability of my own, the characters above were inspired by Nemi and Itachi.

Posted by Paul Sat, 16 Feb 2008 12:24:00 GMT


Apple of a Vulture's Eye

Apple’s last decade has been impressive indeed. Jobs’ stewardship has seen Apple release products that have done more than just change people’s perception of technology. Starting with the iMac and culminating with the iPhone, Apple’s products have changed people’s behaviour - quite a feat in a world saturated with technology. But Apple’s innovation comes at a price. Apple’s products typically work seamlessly and integrate seamlessly. A very limited hardware range helps achieve a level of system stability that’s all too difficult to reach in the combinatorial world of PC hardware. But Apple benefits from that chaos, the cut and thrust of PC hardware development encourages device innovation at a rapid pace that could never be known in a homogenous environment. Nvidia has released more chipsets in the last year alone than Apple has released computers in the last few years combined. Nvidia can do this only because its chipsets will find a home with many different OEMs. Think back to 1984. What if the hammer really had slain Big Brother, if Apple had maintained it’s 1982 market share lead? The computing world today would almost certainly be the poorer for it. Consider a scenario where Apple holds more than an 80% share of the home computer market. Releasing just a few computers a year, losing or winning a contract would be critical to the health of ATI and Nvidia, Intel and AMD. Could they all survive? Maybe. But even if they did, such a climate would surely dictate a level of risk aversion that would gut their R&D budgets. What Apple does better than anyone else is innovate by combining existing technologies, packaging them into must-have products. But the existence of the underlying technologies depends on the heterogeneity of the PC market. Put simply, Apple couldn’t be Apple, couldn’t create the products that define it, if it itself dominated the market. The homogenous Apple environment is a good place to be in, cosy, looking outside to where the nuts and bolts innovation happens, happy in the knowledge that Apple will suck it up, and apply it in a way that no other company could. But as Macs become increasingly popular, will it stay that way?

Posted by Paul Sat, 02 Feb 2008 18:21:00 GMT