Latest Replies
Tuesday
Jun052012

Design Observations on Big Data for Retail

Change of technologies and approaches tends to bring a lot of challenges and problems (which eventually turn into "lessons learned"). This is especially true, when you probe paths that are not common.

Curiously enough, as Charles de Gaulle once noted, such less common paths are also the ones where you are likely to encounter much less competition.

At the moment of writing, one of current projects at Lokad is about rewrite of our Salescast product, which is a cloud-based business intelligence platform for retail (see tech case study).

This rewrite features better design which captures core business concepts at a deeper level. This allows to achieve simpler implementation, better cloud affinity and scalability, while discarding such technologies like IoC Container, SQL and NHibernate ORM.

If you are interested in reasons for discarding these technologies: SQL - too expensive and complex for dealing with bit data in cloud; ORM - complex and unneeded; IoC Container - I prefer simple designs that don't need it. Obviously such mess as WCF, WWF, Dynamic Proxies, AOP, MSMQ etc - are also something I try to avoid at all costs.

One of the side effects is that this system no longer needs complex setups for local development: message queues, event stores, documents, BLOBs and persistent read models are stored in files.

We are using event sourcing for the behavioral elements of the system, while "big data" number crunching is based on a different approach.

This approach has an interesting side effect that I didn't expect.

If anybody in the team discovers a problem in some complex data processing pipe (or any other logic, including business rules, map-reduce step, report generation etc), with exception bubbling up, then in order to reproduce the exact state of the system on a different machine:

  • Stop the solution.
  • Archive data folder and send it to responsible person faulty for the problem (usually me).
  • Responsible person unarchives data folder and starts the solution.
  • Exception will bubble up.

You see, when exception bubbles up in the development environment, the message still remains in the message queue (as a file in a folder). So when we transfer all data to another machine and start the solution - system will try to pick that same message up and reprocess it. Since all data dependencies are included in the data folder, this will lead to the same exception showing up.

Obviously, production deployment of such system is quite different (using cloud-specific implementations for data storage, messaging and event streams), yet principles would still work. This happens because I mostly store data either in append-only structures (BLOBs for large data and event streams for behavioral domain models) or this data is irrelevant (persistent read models that are automatically rebuilt from event streams).

I'm using Lokad.CQRS Sample Project as a baseline for developing this and similar systems.

Here are a few more technology-specific observations:

  • TSV + GZIP is quite good for storing large non-structured streams of data in table form and with little effort (plus, you don't need any tools to view and check such data);
  • When you need decent performance while storing sequences of complex structures with little effort (e.g. sequence of object graphs), then Google Protocol Buffers (prefix-based serialization) offer a fast approach (wrap it with GZIP and SHA1, if there are repetitive strings);
  • when it is worth a few days to optimize storage and processing of big data to insane levels (e.g.: for permanent storage), then some custom case-specific serialization and compression algorithm can do magic (rule of thumb: this might be needed only in 1 or 2 places);
  • do not optimize till it is really necessary; quite often you can save massive amount of time by avoiding optimization and simply using a bigger virtual machine on the cloud (which is cheaper);
  • whenever possible stream big data through memory, as opposed to loading huge datasets entirely. You'll be surprised how much data your small machines will be able to process;

You don't need expensive licenses and hardware (e.g. Oracle, IBM, Microsoft setups usually offered by consultants) to store and process thousands of stores with years of sales history. Likewise, you don't need large teams or big budgets to get the thing ready and delivered. A lot of that can be avoided with the appropriate design. Especially, if that design factors in not only technological and organizational factors, but also shares affinity with business model of a company.

« Essential Reading on Big Data and Persistence | Main | DDD/CQRS Challenge - Integrating Distributed Systems »

Reader Comments (12)

Interesting! But how do you make sure that you don't get problems that occur only in the production scenarios? E.g. if you're using MSMQ in your production environment, how do you ensure that it doesn't behave subtly different than your filesystem based queue?

June 5, 2012 | Unregistered CommenterRoy Jacobs

Roy,

The core abstractions that we are using (messaging, doc storage, streaming storage and event streams) were specially designed to behave the same way in various deployment options (and then improved based on the production experience with multiple projects).

In short - we had all sorts of problems that allowed to polish the design that is reused in new projects.

June 5, 2012 | Registered CommenterRinat Abdullin

Ah, right. Maybe you can elaborate a bit more on those problems in a future post?

June 5, 2012 | Unregistered CommenterRoy Jacobs

@Roy, it would be hard - since there are a lot of little details I no longer remember about.

However, if you are interested in actual design - I would recommend to check out Lokad.CQRS Sample Project, which I use as a baseline for almost all my recent projects.

June 6, 2012 | Registered CommenterRinat Abdullin

Great post. First off let me say thank you for all the work you do in the community. I have been looking over the code for the Lokad.CQRS Sample project and find it very informative. What I think would make an interesting post would be to demonstrate a simple implementation of Queue based system using your atomic storage. In my opinion would help prove your point that you should avoid big expensive tools like SQL when possible. I would also like to hear how Lokad.CQRS integrates with your BI infrastructure. Sorry for my wordiness it's the end of the day and using a tablet is not as easy to post comments as a computer. Thanks again.

June 6, 2012 | Unregistered CommenterJason Wyglendowski

Jason, just a few quick answers.

Atomic storage is just an abstraction that is used to persist and update documents (and that happens to fit both cloud and file persistence). It does not relate to queues by default (atomic/document storage and queues are separate concepts). However, one sample of these two working together is in projections (just search for "projection" in the anatomy branch of latest Lokad.CQRS sources).

Lokad.CQRS does not integrate with our BI infrastructure per se. Some parts of BI infrastructure are implemented on top of it (some other parts, which are historically developed by different team, are implemented on top of another set of abstractions called Lokad.Cloud).

By the way, within the Domain-Driven Design initiative I'm planning to update Lokad.CQRS with the proper (more complex) domain model, that should demonstrate how various moving pieces come together. Hopefully I'll be kicked enough to publish more materials and articles on the subject as well.

Hope this helps.

June 6, 2012 | Registered CommenterRinat Abdullin

Rinat thanks for the reply. I look forward to the example and will look into what you suggested. I am really interested in the mechanics of storage like an event store that is not using a database but instead uses some sort of local file store and can be used to retrieve entities. Most of the examples I have seen use a relational DB. I have not had a chance to investigate Lokad Event Store perhaps I will find my answers there. Have you ever consider offering courses like Udi Dahan does or maybe approach Pluralsight, Tekpub or even Dot net rocks to do a DNRTV? Again keep up the good work and hopefully you guys at the distributed podcast will have something. A plus... Jason

June 7, 2012 | Unregistered CommenterJason Wyglendowski

Ok, but if you delete all of these technologies from your design, what technologies you'll use ? and how you substitute them?

June 7, 2012 | Unregistered CommenterManue

@Jason, I believe Rinat meant to say the "anatomy" branch, not "atomic" branch of this sample solution:
https://github.com/Lokad/lokad-cqrs/tree/anatomy

@Manue
The best way to get an idea of how Rinat replaces those technologies is to look at the Lokad.CQRS sample solution:
Solution Overview: http://lokad.github.com/lokad-cqrs/
Sample Code In Latest Branch: https://github.com/Lokad/lokad-cqrs/tree/anatomy

Inside you will see examples like:

Replace IoC Container literally with a class named:

"SetupClassThatReplacesIoCContainerFramework.cs"
(inside of the "SaaS.Wires" project)

Persistence:
Replace SQL and NHibernate ORM with:

Objects get serialized to storage as they are. There is no object-relational impedance mismatch because the objects are not being mapped to a relational schema. Event Sourcing is used in this sample so that objects load their state from event streams that are queued and persisted in either:

Local Development:
Regular files and directories on the local file system (see screen shot in this blog post above as an example)

Production Deployment Example:
Azure Blobs, Azure Queues (see the Cqrs.Azure project inside the sample solution)

There are many ways to implement Distributed Systems. Some approaches use the technologies that Rinat prefers to eliminate, and others do not. See this post for more info: http://abdullin.com/journal/2012/4/17/ddd-from-reality-to-implementation.html

June 8, 2012 | Unregistered CommenterKerry Street

@Manuel, I posted a new post, trying to address your question

June 8, 2012 | Registered CommenterRinat Abdullin

@Jason, please hold on for a few more weeks. I'm currently working on some sample code that specifically focuses on event sourcing and implementing aggregate roots.

This code will accompany some text on the same topic I'm contributing to IDDD book of Vaughn Vernon, however it will also be available on the github for everybody's benefit.

Re courses - didn't really consider them at the moment, but this is quite a possibility in the future, if I get some free time.

June 8, 2012 | Registered CommenterRinat Abdullin

@Kerry, thanks for the detailed response and double kudos for catching my "anatomy" typo. I'll need to merge the darn thing into master pretty soon :)

June 8, 2012 | Registered CommenterRinat Abdullin

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>