Lokad CQRS - Message Throttling and Auto Scaling in the Cloud
There was an interesting question on implementing message throttling with Lokad.CQRS in Windows Azure recently.
Message throttling answers questions like:
How can I throttle messages to ensure that a specific endpoint does not get overloaded, or we don't exceed an agreed SLA with some external service?
In essence, we might need to ensure, say, that the handler does not get more than 3 messages of some kind per second to process. Technical solution to this problem with Lokad.CQRS (leveraging advanced scheduler scenario) has been posted in community (feedback is welcome).
However, let's consider the implications of this scenario. For example, we are throttling messages per customer, to ensure that if customer X sends 100 messages at once, this does not slow down responsiveness by customer Y.
If we implement throttling, this will solve the problem for customer Y. Yet, the proper implementation (resistant to Azure Worker Role recycles and other failures) will add quite a complexity to the solution, increasing maintenance and development costs. As we know from experience, linear increases in solution complexity tend to raise related costs exponentially till they become prohibitively high.
Another side effect is that our customer X (which might be a rather important customer) will get somewhat slower experience, loosing some business value.
Third problem is: message throttling does not address the situation, where the number of customers increases and at some point 2000 of them just happen to send a 5-10 messages in short interval of time (customers tend to do that). The numbers might as well be under the SLA, so throttling will not apply at all. Yet, every single customer will experience slow-down in processing.
Fortunately, in CQRS scenario, in this case at least UI will still stay extremely responsive for all read and browsing operations under such usage spikes.
How do we deal with such problem? I would advise to consider scaling out scenario instead of passive message throttling. Since we are running in the cloud (or can burst tasks to it), we can just ask it to increase the number of workers processing the message load, in response to the increased load.
Alternatively we can actively track average loads (i.e.: message waiting and processing time by type of the message and some properties), predicting them and deploying more workers in advance.
At the moment Lokad.CQRS does not have dynamic scalability capabilities. Lokad.Cloud has this in form of Auto Scaling, which works rather nicely in the real-world scenarios. However, rather soon we might need to introduce this to Lokad CQRS as well (either in form of reusable module coming from Lokad.Cloud or a separate scalable host functionality).
Any feedback on such feature is welcome, as it will help to shape such functionality to be compatible with the needs of the community and not just the self-tuning CQRS vision of Lokad.
So, what do you think?
Saturday, July 24, 2010 at 17:05