Latest Replies
Monday
Dec032012

Being the Worst Updates

This is an update post to being the worst podcast on software design, which we started with Kerry Street at the end of this summer.

A few things have changed since then.

Module 1 (or Season 1, if you wish) is almost finished. In the second one we will switch to slightly more complicated domain (Factory is too simple). Further episodes will base on previously covered material with additional focus on:

  • Production experience of a SaaS project run by one man or a small team;
  • Collaboration between team members (or with external outsourcing parties);
  • Collaboration between multiple sub-domains; integration with external systems;
  • Deeper level of Domain-Driven Design;
  • Patterns of Client UI development (including mobile clients, web UI).

We'll see how it goes. So far, we are still the very worst and excited about staying this way. Growing community and support it provides - encourage us to keep moving forward personally while sharing the lessons learned:

  • 250 subscribers milestone reached and still counting.
  • Tonight I plan to record first episode of Being The Worst in Russian with Anton Vinogradenko. It will mirror english version, benefiting from its (relatively) coherent viewpoint, heavily commented samples and all additional reference implementations that will be added later.
  • Tom Janssens contributed sample code in erlang for one of our episodes. There is work in community to provide Java equivalent for that, as well.

Sharing knowledge became even deeper part of Lokad approaches (which are presented in this podcast). The same materials are reused in development training. Additionally our new Lokad Data Platform initiative shares theoretical foundation with BTW podcast. It is currently being introduced to largest retail companies in Europe as a way to enable high-speed data integration between formerly locked systems. Approaches and experience behind Data Platform will be covered in later modules of "Being The Worst" along with all these "big data", "real-time", "cloud computing" and "business intelligence" topics.

Besides, I'm personally just curious, what would happen to the development world if some educational equivalent of "Advanced Distributed Systems Course with DDD and Event Sourcing" would be made available for free to the community (along with coherent set of additional study materials, assignments and samples). This can be an empowering social elevator for the people from undeveloped regions or poor families.

What do you think about the podcast so far?

Monday
Nov262012

Rule of Time Limiting

I don't know who has passed this feedback, but I'm extremely grateful to this person. It's relatively easy to get a good feedback, but constructive ways of improving yourself are priceless.

The cardinal rule of any opportunity to present is this: Thou shall stick to your time limit. If it is being pointed out that you are overtime, then you will stop immediately. Anything else shows disrespect and has the aura of "I am more important than you, just suck it up"

Thanks, man. This will also help to stay focused with BTW Podcast.

Friday
Nov232012

Recommended Reading on Big Systems

Here's a list of videos and resources I've been studying from recently (more valuable - on top):

On the overall, I'm really impressed by the amount of innovation and sharing that happens in the industry.

Friday
Nov022012

Don't Abuse SQL for Large Datasets

There are companies, which have data-intensive workflows, producing and consuming a lot of information. It is both good and bad. Good thing is that they are aware of the importance of the data, often treating it as immutable (or "too valuable to ever be lost"). Bad side effect is: data can be captured uncontrollably using most familiar technology, which often is SQL. In other words, you might have rather simple data structures, like sales history or visit logs, stored in relational databases for analysis and archival (simply because that's what company IT is comfortable with).

While such approach is a good choice for small datasets, it can have a lot of drawbacks as amount of data captured grows. You need either really expensive Oracle setup or really skilled DBAs in order to store billions of records in a single database. Even then, generating weekly or monthly reports might slow everybody down a little bit.

Root of this problem is rather simple: entire concept of relational databases was born at times when servers, disk space and memory were scarce and expensive. These were the times, when people believed: "640K ought to be enough for anybody".

However, times changed. Prices and availability of resources have improved drastically as well. For example these are the options that were introduced just recently:

We are currently living in the days, where it might be cheaper to store precalculated read models on fast SSDs, rather than compute them on-the-fly, while caching in RAM. By the way, Netflix switched their memcached layer for SSD to serve videos, while reducing total number of servers.

Availability of new storage options is just a tip of an iceberg. Technologies and research were not standing still, as well. For example, consider Twitter. Twitter would be incapable of doing their thing with just a relational database. So instead they are using highly specific technologies and data manipulation approaches for different functionality within their system:

  • Event streaming to deliver changes in real-time to different teams and components.
  • Fanout with view precomputation and caching in heavily-partitioned redis clusters for simple lookups.
  • Full ingestion of all events by an indexing cluster to provide complex search queries via scatter-gather.
  • Pushing events to cluster of real-time streaming servers (organized in cascades) with filtering on the edge - for real-time filtered push notifications with millions of concurrently opened sockets per cluster.
  • Incremental batch-compute process with Apache Hadoop - to provide big analytics over the entire dataset, it also corrects any small mistakes and wrong estimates that were done by real-time precomputation.

As you can guess, trying to use one single technology ("to rule them all") could be a lot more complicated and expensive for the Twitter, if compared to their existing setup. Same applies to the other companies with their own stacks as well. For instance, LinkedIN uses kafka and databus for event delivery, voldemort DB for fast readonly lookups of precomputed Hadoop results and Apache Lucene for some indexing.

As you can see, both Twitter and Linked have rather data-intensive places, where SQL is not used at all. Obviously, this is just part of the big picture. For example, LinkedIN still uses Oracle for user to user emails (since it's cheaper to pay fees, than rewrite this specific subsystem) while Facebook is a well-known user of mySQL.

Having that said, I still believe that relational databases still have their merits and uses. They are good tools optimized for specific tasks. It's their misuse that can cost you in money, risks and time.

Sunday
Oct212012

Status Update

Things have been moving really fast over the last few months, expected and unexpected alike. Here's a quick status update to provide context map of my professional life. This is equivalent of roadmap blog posts we do internally at Lokad once in a while.

Being the worst podcast moved forward nicely. We already published 11 episodes, and Kerry is clocking overtime on getting 12th ready for publishing. There are 200-250 subscribers and a lot of positive comments that help us to keep going.

Existing set of episodes so far covers basics of messaging and domain-driven design (the part which covers aggregates with event sourcing or A+ES). This material:

  • backs up IDDD sample on A+ES;
  • Explains Application services with A+ES as they are supported by Lokad.CQRS (or by any messaging bus);
  • serves as foundation for moving forward.

There are plans to move forward, most of which relate to "Don't break the Chain" sample for Lokad.CQRS (thanks to the communities of BTW and Lokad.CQRS).

Existing material of BTW will serve as foundation for my part of class with Vaughn Vernon in Joburg in November (provided I manage to get a visa in time). If there will be a DDD road-trip around Europe, I might also be talking there about A+ES and related things from DDD.

Lokad.CQRS is rather stable as a set of abstractions for building simple distributed systems (which can run on cloud and locally). However at Lokad we are facing new challenges that require slightly more capable infrastructure, hence the [introduction of Data Platform]. Developing it already brought a lot of new experience on collaboration, performance and other practical things (kudos to Ufa team). Core concepts of this platform will stay in sync with IDDD and Lokad.CQRS - same terminology and building blocks, especially with regards to aggregates and projections. The latter is extremely important, since we want to keep on:

  • building a coherent body of knowledge on building distributed systems;
  • keep on sharing that knowledge and supporting the community.

If all goes well, Data Platform .NET client could act as another adapter for storage abstractions within Lokad.CQRS (fourth, in addition to memory, files and Azure). Besides, it also brings a lot of practical experience on simplifying Lokad.CQRS and making more stable.Topics of "big" data with DDD, NoSQL and EDA architectures made simple is something I would love to bring to the classes both in BTW podcast and in real life.

Lokad Code DSL tool was just been pushed to a separate project for simpler reuse outside of Lokad.CQRS world.

All this is just a small step in a long journey ahead. Thank you for travelling along, helping, teaching and supporting.

Saturday
Oct202012

Scalability targets of Lokad Data Platform 

Lokad Data Platform (introduced in the previous post) will consist from a set of simple building blocks, shared with the community for free as open source project:

  • Simple event storage server optimised for cloud computing environments; with a REST API.
  • .NET client library for pushing to event storage or reading from it.
  • Set of samples, showing how to use this in practice. One of the samples imports data dump from Stack Overflow and then computes aggregated reports across gigabytes of data.
  • Accompanying documentation and guidance.

Just like it is with Lokad.CQRS projects, this server can be hosted both on Windows Azure and on a local machine (without any Azure dev fabric). We found this to be a liberating factor for developers.

Because of the REST API, you are not limited to .NET world: Java, Scala or Go are among the other options.

Data Platform scalability targets on Windows Azure:

  • Unlimited Size of data, since Windows Azure would actually handle all the heavy lifting. We aiming for gigabytes and terrabytes for a supported scenarios in the very beginning.
  • Sequential write throughput of 40 messages per second with absolute consistency and 3 replicas (if pushed sequentially by 1-10 clients running in parallel. Measured for messages of 10 bytes - 1Kb). This can be improved significantly at cost of some consistency.
  • Batched write throughput of 60000 messages per second (and higher) with absolute consistency and 3 replicas, if messages are batched together in groups of 50000. We have throughput of a few MBytes per second right now.
  • Reading throughput - up to 50MBytes per second; geo-replication and content-delivery networks are supported.

Although, windows Azure is our major production deployment scenario, here are numbers for file-hosted deployments (without replication):

  • Sequential write throughput of 200 messages per second.
  • Batched write throughput of 1000000 messages per second (up to 50-70 MBytes per second).
  • Reading throughput up to 50-70 MBytes per second.

As you might have noticed, these numbers are more than modest, if compared to what could be achieved in this tech environment.

However, at this point we are not worried about breaking records via some complex code. We just want to ship something that is good enough to be:

  • simple and educational for everybody;
  • rather reliable;
  • scalable.

If you really need something more performant for a price (with a slightly different set of priorities), you might want to check out Greg's Event Store. Additionally, you can also drop an email to Lokad for a help with a custom solution.

If you are a sole developer or a small startup - I'd recommend to start by going through the freely available materials (after they get published) and also ask questions in Lokad community.

Saturday
Oct202012

Introducing Lokad Data Platform 

Past few weeks were extremely busy for Lokad in many areas. I guess, that's how you feel when a start-up starts lifting off.

One of my personal priorities was focused on a new Lokad project called Data Platform. If you are interested in business details, there is a blog post about it with a nice infographic.

Data Platform is essentially is going to be:

  • methodology and guidance on bringing together "big" data in organization and making it easily consumable across the same organization. All this - at a fraction of the cost and complexity usually related to doing the same with outrageously expensive "enterprise" setup;
  • open source reference project demonstrating how to aggregate and process relatively complex domain data either in a local data-center or in windows azure cloud;
  • should you need this - consulting, teaching and support from Lokad and partners on both technological setup and on details of details of dealing with the business intelligence;
  • as you can guess, a lot of details and concepts will explained in greater detail in due time within this blog and BeingTheWorst.com podcast.

From the technical perspective there is absolutely nothing new or innovative in Data Platform. A lot of high-tech companies have been doing "big data" and "cloud computing" for years, starting from high-frequency trading and up to the large hadron collider. If you seen Greg's Event Store, Lokad.Cloud and Lokad.CQRS projects, you already know what to expect.

From the business perspective, situation is totally different. All this publicly available knowledge is as good for a vast majority of our customers as if it were developed and used on the dark side of the moon. It's too far and too hard to get people who can handle it.

As you can guess, IT enthusiasts would not normally feel excited about managing data coming from cash registers in a retail company. This slows down technological progress a lot in such companies and creates an opportunity for sales-oriented software and consulting companies. These companies jump right in, selling mediocre but expensive stuff and not really solving the problems.

Such situation is the reason why 20GB of sales history is actually considered to be big data in these companies.

That's what we are trying to break with DataPlatform. It's too painful for us to help customers with the business intelligence (our core competency), when we can't even get the data out of "expensive stuff" - it either breaks, slows everything down or requires literally months of effort to extract data.

With Data Platform we want to show how it is possible in some specific cases to replace "expensive stuff" setup with something much simpler and cheaper, while getting even better performance and reliability. For instance sometimes a 1000000 EUR cluster can be replaced with a few virtual machines on Windows Azure for reliable storage of terrabytes of data, while having decent throughput and dead-simple way to consume such data.

The idea behing DataPlatform is to make it extremely simple and cheap to give it a try (at least as a way to finally stop discarding valuable business history data). Think about dead-simple event store (even simpler than the one used in the latest Lokad.CQRS) and projections on top, designed to get the most out of Windows Azure and store gigabytes/terrabytes of messages.

That's a small and obvious step for a lot of people reading my blog, however for a lot of enterprise companies that's a huge leap. If it were to be made, then suddenly there is a huge number of opportunities for moving forward: enterprise-wide Domain-Driven Design, efficient development organization around CQRS models, continuous delivery of certain elements, inherent scalability for these reports, capabilities for real-time and occasional connectivity etc.

Lokad Data Platform

Plus Lokad can stop wasting time on data integration issues and actually focus on business analytics.

PS: If all works out, DataPlatform will be pluggable into Lokad.CQRS as another event storage engine.