Handling Big Data in Cloud

Big Data is one of the new hype terms that is gradually gaining popularity. No big surprise - amount of data around us is gradually growing and by properly mining it we can get competitive advantage and eventually make more money. Money attracts money.

Usually by term big data we mean a collection of datasets that are so large, that they become too hard to be processed on a single machine. This data can even be so big, that it takes hours or days to process it in a data warehouse.

Fortunately, cloud computing comes to the rescue. It provides following resources:

  • nearly unlimited scalable storage capacity for data (i.e. Azure Blob storage or Amazon S3)
  • elastic compute capacity to process this data (i.e. Azure Worker Roles or Amazon EC)
  • network capacities and services to transfer data and coordinate processing (queues and actual networks).

These resources are elastic (you can get as much as you need) and paid-on-demand. Latter is really important, since you pay only for what you use.

I'm assuming here, that in your business model, more data processed means more money made.

There are three distinct approaches in handling big data, based on the specific challenges.

Batch Processing

First one is about batch processing, where you do not need extremely fast computation results but have terabytes and petabytes of data. This is essentially about implementing MapReduce functionality in your system, often resorting to hacky but extremely practical tricks. Below I'll talk about some of the lessons learned in this direction in Lokad (we provide forecasts as a service without any hard limit on amount of data to process).

Obviously, storage of this data becomes a primary concern with this approach. Fortunately starting companies can leverage already existing cloud computing resources, starting from Google's GFS and up to Azure Store. Or if you are big enough (and this is worth it), you can roll a data center of your own. Idem for computing and network capacities.

Stream Processing

Second approach deals with cases, where you need high-throughput and consistently low latency on gigabytes of data. High-frequency trading is one of the most known domains here (various real-world telemetry systems being the second one). Abusing event streams and ring buffers (like in LMAX) along with partitioning - helps to deal with the challenge.

Due to latency requirements, cloud computing might be not the best choice for latency-sensitive solutions. It is more efficient to roll out fine-tuned infrastructure of your own including specialized hardware and soft (things like InfiniBand and ASIC or even Mixed-signal chips).

However if you are OK with a bit of latency and are more interested in high-throughput continuous computation, then the flexible network and CPU resources provided by cloud are a good match.

Realtime Batch Processing

Third approach to Big Data involves dealing with more complex requirement - providing real-time capabilities over vast amounts of data (so we have to be both fast and capable of handling petabytes of data). Twitter is a vivid example here - it needs to provide (almost) real-time analysis over billions of tweets around the world (although fortunately it does not to go at microsecond level).

This challenge can be solved by actually mixing first two approaches: we would use slower batch-processing for dealing with actual bulk of data (tricks like preprocessing, append-only storage and MapReduce work here), while latest data will be handled through the real-time streams (stream processing and continuous computation). Results are merged as they come out of these two data pipes. Over the course of time (i.e. daily) latest data is pushed from "hot" zone into the bulk data.

More often than not, real-time data processing is done by simplified approximating algorithms that deal with subset of data. Batch processing is more through and precise (at the cost of being slower). By pushing real-time data back to the bulk data, results of the computations are actually corrected, as they are recomputed by batch algorithms. Twitter's Storm project features nice overview of this approach.

- by .