Essential Reading on Big Data and Persistence

In my previous post we've discussed some design considerations for handling big data in retail. Let's continue from here.

Joannes Vermorel has just completed a really interesting whitepaper on storing sales data in retail. He outlines a few rather simple principles that allow to store 1 year of detailed sales history of 1000 stores on a smartphone. Both the white paper (PDF) and source code are shared by Lokad on github.

I'm not claiming, that this is a production-ready scenario, since it is missing things like continuous replication (to another smartphone), checksumming and BI capabilities. However the point here is that SQL server or generic No SQL server might not be necessarily be the best fit for this situation.

Curiously enough, in scenarios when companies need to store similar amounts of sales history, they don't take simple and rather cheap approaches like this one. Instead, consultants sell them rather expensive Oracle, Microsoft (put any company in big data field) software and hardware setups that still fail to keep up with the throughput of the data. For some reason, if you can write 50000 ticket receipts per second to a file (where each receipt usually contains a dozen products), this does not necessarily mean that you can have the same throughput inserting rows to your favorite SQL database cluster. So why do we even use them?

I don't hold anything against SQL (or any other relational storage), except the fact that SQL DB is being sold as a silver-bullet for cases, where it is clearly not applicable. And I hate to see huge amounts of money wasted in a useless way (at least, donate them to a charity or noble cause instead).

By the way, check out this great paper by Erik Meijer and Gavin Bierman: A co-Relational Model of Data for Large Shared Data Banks. It provides nice insight into the nature of relational (SQL) and document (Not Only SQL) persistence options.

So why do we keep on applying expensive sub-optimal solutions to problems that do not fit them? Probably, because "nobody get fired for buying IBM", while trying some non-conventional approach and failing is more risky to your career.

However this will not necessarily hold true in the next years. Economic and technology forces are too strong. Just read this amazing white paper from Pat Helland, which was written way back in 2007 (and don't get surprised if you find a lot of things that look like modern principles behind event sourcing and domain-driven design).

I do not intend to criticize SQL databases or any other product, but rather to give broader perspective - they are not the only data persistence solutions out there. There are more options. And sometimes, a few specialized lines of code can beat a generic product both hands down (simply because they can be more tailored to the problem, than a product would ever dream to be).

Published: June 07, 2012.

🤗 Check out my newsletter! It is about building products with ChatGPT and LLMs: latest news, technical insights and my journey. Check out it out