Last week I attended a conference about MongoDB - a leading player in the emergent NoSQL space. (The sessions were videoed by the excellent SkillsMatter, you can watch them here).
Other players in the space that look interesting are CouchDB, Neo4j, Cassandra, I’ll likely be investigating all of them in future posts. Now these various projects are not drop in replacements for each other, but at a high level they are all solving a problem in the BIG / DISTRIBUTED / REALTIME / CALCULATING ™ space.
There are a lot of issues that immediately come to mind, before we even start looking at the technology. A few questions that immediately spring to mind:
What do we mean when we say ‘BIG’?
One of the presenters talked about their system handling massive amounts of data, then went on to say they processed 40Gb a day.
Another presenter claimed to have quite a lot of data and then mentioned that they had 7Tb.
Quite a difference.
What do we mean when we say ‘DISTRIBUTED’?
I’ve worked on a distributed end of day risk calculation engine for a large investment bank. Around 40,000 blade servers were available, although typically a single instance of the calculation engine would use several thousand (basically calculations are partitioned by data)
At MongoDb the presenters were talking about 30-50 servers, usually Amazon Web Service virtual machines
Are we happy with all servers in one data center/geographic region?
What do we mean when we say ‘REALTIME’?
In the past I did some work on Flight Simulators (strictly speaking I worked on the management console, rather than the avionics/flight simulation piece). This is where real-time really means, “You must finish the next calculation in x ms
An interesting use case is when you need a snapshot/consistent view of data. In finance this is often required for end of day regulatory reporting. But the action of snapshotting the data needs to happen quickly, because lots of downstream activity depends on this consistent view of data.
What do we mean when we say ‘CALCULATING’
We have two basic modes - batch and incremental. In batch mode we take a consistent view of data and apply some Maths to it to come up with some results. Incremental calculation means that as new data comes in the calculation somehow incorporates that in the current output.
An interesting use case here is where you have some sort of usage based pricing. For example you may offer a customer a rate of, say 0.1pence/transaction for the first 1000 transactions; then 0.09pence for 1001-10,000; 0.05 pence for >10,000 to a maximum of £10,000 . So the calculation may alter its character depending on the data. (As an aside, think about offering these tiers over time periods different to your charging period…we’ll visit this in a future post)
these concerns are not orthogonal and that makes the problem more difficult - e.g. It’s difficult to offer consistent calculations without consistent data. It’s difficult to offer guaranteed ‘realtime-ness’ if the calculations take serious clock time.
Back to the projects mentioned earlier. It’s clear from consideration of the documentation that each of these that they’re coming at these problems from different directions. It’s important to understand the preferences/biases of the project to ensure that it fits with the shape of the problem that you’re trying to solve.
Watch this space for further evaluation…