Reading List on Big Data

This is purely theoretical blog post to summarize my last few days of studies into the big data (which were triggered by one homeless guy and sequence of highly unprobable events that actually took place). No fancy intro, just assuming that dealing with big data processing is really cool or at least has an outstanding financial reward potential (given the trajectory taken by modern IT and economics).

Current Approach at Lokad

Or at least - simple part of it, that is not touched yet by secret fairy dust of our outstanding analytics team.

In short, big data is bulky, complex, requires a lot of CPU and does not fit in RAM. So we've got to break calculations into smaller batches. We want to process everything as fast as possible, so we push these batches to separate machines. Concurrent processing can cause a lot of problems with race conditions, so we cheat:

  • Batches of big data are immutable once written (that should scare anybody who is coming from the old world where storage was expensive and SQL ruled the world). So we can share them.
  • We keep bulk processing as dead-simple immutable functions, that are woken up by a message, consume some immutable data, do heavy CPU job and then produce output message (which could include reference to a newly created immutable block of data)
  • These immutable functions construct a processing graph of (extremely simplified) map reduce implementation, that can actually do a lot of things.
  • When graph elements need synchronization, we do that via messages that flow to aggregate roots controlling the process. They don't do any heavy-lifting (they don't even touch actual data, but just metadata level information), but encapsulate some complex behaviors to navigate execution through the computational graph. We don't have problems with implementing this seemingly complex part (or testing it), since Greg Young and Eric Evans provided us with CQRS/DDD toolset and some event sourcing.

This is a poor-man's architecture that can be implemented by a single developer from scratch in a month. It will work and have decent elastic scaling capacities, provided that you have abundance of CPU, Network and Storage capacities (which are provided by cloud environments these days).

This approach is explained in a lot more detail in Processing Big Data in Cloud à la Lokad

Potential caveats:

  • Domain modeling should be done carefully from the start. Lay out the computations and think through the algorithms.
  • Cloud queue latency and complexity would be your bottlenecks in the cloud (all the other limitations are solved by adding more VMs)
  • This is a batch-processing approach, which is not fit for real-time processing.
  • Yes, this is a hand-made implementation of MapReduce.

How can we improve that?

So what are the limitations of the previous approach?

  • Complexity
  • Messaging latency
  • Absence of real-time processing

First complexity limitation can be handled by separating the system into separate elements. No, I'm not talking about layers (these will do more harm than good), but rather separate bounded contexts, that have clear:

  • boundaries;
  • language;
  • contracts for exchanging data and communicating with other bounded contexts;
  • whatever choice of technology that is fit.

Second limitation of messaging latency can be worked by saying au revoir to any solution with man-in-the-middle architecture (in cloud these implementations are called "cloud queues" or "cloud service buses"). Broker-based architectures are a logical dead-end for the large-scale distributed computations (just like relational databases are for persistence). They limit our scaling capabilities.

So we need to rewire our brains just a little bit and leave the world of ACID, transactions and tables (wake up, Neo):

Or, if you want a simpler path, just use Hadoop.

Third limitation is lack of real-time processing. You see, this approach to big data is still good old batch processing. It grabs a chunk of data and then takes some time to process it. If we are working with real-life information, then by the movement we finish processing history, we'll have some fresh data that requires recomputation.

If this were end of the road, Twitter would never exist. But they cheat. They provide real-time capabilities by incrementally processing new data using some crude and rough algorythms. They are not as precise as batch processing, can make mistakes, can handle only a little bit of data, but they are fast.

So whenever a user loads up his twitter profile, he gets result that is composed from thoroughly processed batch data plus whatever changes happened since this slow process started. The latest bits might not be precise (i.e. they could miss a few mentions from fellows in the other hemisphere and also include a few spam-bots), but they are almost real-time, which is what matters. And actually as the time goes, results will be magically corrected, because a little bit later batch map reduce algorithms will get there and replace fast approximations with slow but thorough results.

Twitter Storm incorporates end explains this dual approach to big data in greater details.

Edge Cases

People say that there are no silver bullets (which is actually wrong, you can buy that stuff on internet) and hence one approach to big data will not fit all cases. Let's ignore total absence of logic in this statement and focus on potentially specific edge case of big data that might benefit from a separate approach. I'm talking about event streaming.

Enterprise companies consider event streams this to be a rather complex scenario, when "a lot of events come in a short amount of time". Such complexity even created niche called Complex Event Processing which sounds like really complicated and expensive field. Partially this is justified because events often require proactive reaction to something that happens in real-time. Yet this reaction could depend upon the events that happened millions of events back (this means "a lot of" data has to be stored).

From now on, please replace "a lot of" with the term that you will hear in 12th episode of distributed podcast which was recorded this weekend.

Let's see if there is a simple poor man's approach to handle "a lot of events". We will need following ingredients.

First ingredient is fast storage, that is dead-simple, provides immediate access to any data by key (at most two disk seeks for cases, where entire key index does not fit in memory). We don't ask for ACID or query language here, but ease of replication is definite plus. Basho bitcask and Google's SStable provide extensive guidance here.

By the way, IndexDB in WebKit uses LevelDB, which is based on SSTable.

Second ingredient is distributed architecture design that would support arbitrarily large amount of nodes operating in a single prepartitioned ring-shaped hash-space, where nodes can come, fail and go as they wish. Data should survive no matter what (and be easily available), while scaling can be elastic. If this sounds a bit scary for you, don't worry, there is plenty of material (and source code) from:

Third ingredient is cloud computing, that would provide CPU, Network and computing resources needed to power that architecture, while not requiring any upfront investments. I've been blogging about this aspect for quite a few years for now. At this point in time we already have a number of cloud computing providers with some competition between them. For example:

  • Windows Azure Cloud;
  • Amazon AWS;
  • Rackspace Cloud.

Competition in this area already drives prices down and rewards efficient use of resources.

Fourth ingredient is about lower-level implementation principles that fit into the distributed environments, append-only persistence and eventually consistent world that even tolerates conflicts. Great head-start in this area is provided by ZeroMQ tutorial and filled by the mess of patterns and practices united by the name of CQRS architectures.

Since these CQRS-based principles actually provide a uniform mental model for dealing with events and events streams in distributed world (unless you are playing it old-school with SQL) along with techniques of designing and modeling such abstractions (known as DDD), they already give solutions to most common problems that you might face, while bringing together in practice distributed architectures, fast streaming key-value stores and cloud computing.

Fifth ingredient is the Force. May it be with you, in this new but certainly exciting field :)

[2012021402321337237189] Memory dump complete...

- by .