Semi structuring CouchDB databases
A recurring question on the CouchDB mailing list and IRC channel is one of document structure. Most developers exploring CouchDB bring their SQL heritage along for the ride. Shifting your way of thinking doesn’t happen overnight, but using CouchDB effectively requires ditching the relational mindset.
Alex Lang’s recent Scotland on Rails talk about CouchDB and its Ruby libraries generated a little flutter on Twitter. At heart was his assertion that while you could take a relational approach when structuring your documents, to do so would be to miss the point. I concur. Given that the terminology around relational dbs and normalization is often abused, I’ll make what I mean explicit.
The relational in RDBMS has a well defined meaning that has nothing to do with multiple tables relating to one another. It means simply that a tuple has a value defined over it - the cells in a table row are related in some way and that relation has a meaning associated with it. That’s it.
A document oriented database has no real analog of a table and so is inherently non-relational. You could, of course, impose relational semantics by structuring your documents in a particular way. You could also normalize your structured documents, taking care not to assign an object or array to a document property, as doing so would preclude your database from being considered normalized (1NF). But why bother? Why not just use an RDBMS?
I’ll add, in case of ambiguity, that storing non-normalized data is completely orthogonal to storing doc ids as document properties, thereby allowing inter-document navigation.
So given that the relational model is effectively abandoned, how should document databases be structured? I’ll simply offer my own opinion, noting that many different approaches exist. (One of the more interesting deviations from the standard RDMBS model is the single db per user approach.)
If new data supersedes existing data, should the existing data be kept (e.g. for analysis), or may it be deleted?
In general, larger databases and views take longer to query. The difference is typically marginal, unless querying with a
group_level > 0in which case the repeated executions of the reduce function may become significant. A simple benchmark I carried out resulted in an avg 0.035s response time when querying against 1.6m docs withgroup_level=0. Querying withgroup_level=2took 2.010s. It’s worth pointing out that I conducted the benchmarks in December; CouchDB improves rapidly.Should writes be contention free?
Guaranteeing contention free writes means a new doc per write.
How do you want to retrieve the data?
While views offer great flexibility in aggregating data, an up-front understanding of how you’ll want to retrieve your data is beneficial. Issuing multiple requests to CouchDB and performing client side aggregation is a fine approach, but sometimes it’s simply easier to retrieve all the required data in a single request. This is particularly the case when paginating a result set. Trying to paginate across the combined results of multiple queries would be a real PITA.
Judicious denormalization
You can’t really denormalize something that wasn’t already normalized, but the idea is to embed unlikely-to-change properties in documents that aren’t the canonical source of that property.
For example, an Invite may contain a recipient name and sender name, but also contain the doc ids of the docs representing the recipient and sender. This approach allows invites and their relevant presentational information to be retrieved with just a single CouchDB query, but it also allows for simple navigation to the referenced documents (recipient and sender). I make heavy use of this approach myself and RelaxDB supports it explicitly.