Stress testing the stability

You think that your distributed system is stable and ready for the production, do not you?

So did I, before trying out this simple "How to break your distributed system" recipe:

  • Get fresh dataset for your database (it should have the size comparable to the production data, or even larger)
  • Prepare simple command-line agents that emulate user activity (CRUD actions against different entities).
  • Take 10-100 of these agents and let them boil in stress mode (1-5 sec. or no delay between actions)
  • Fire up all distributed automation/processing services that you have in the picture (obviously, in the stress mode, as well)
  • Optional: continuously stir connectivity to the Database and Application Virtual Machines
  • Let everything cook for some time

My first unhandled exception (it was a deadlock) bubbled up within 30 seconds after firing this whole thing up. And it is really to reproduce this one - you just have to restart everything and wait for a minute or so.

The system would be called relatively stable if it can survive 24h in the stress mode (and validation proves that all the scheduled tasks have been properly completed).

- by .