3 years ago I was invited in SkuVault to help with a set of business and technical challenges:
- short-term: figure out logging and telemetry for a distributed system at scale;
- mid-term: scale out the software in response to the business growth, while also improving the performance;
- long-term: simplify the design of SkuVault, making it easier to maintain and add new features.
For me it all boiled down to a single goal: ensure that SkuVault can handle Black Friday peak load without any major issues (meaning that an ordinary busy day would just be a breeze).
The journey took 3 years. Finally this year SkuVault had a very boring and uneventful Black Friday. During the same period some of the competitors have experienced customer-impacting performance issues.
This achievement was a result of work done by multiple teams:
- V2 team drove the migration, continuously learning and fearlessly adopting new technologies and development practices;
- V1 team covered the backs by supporting the existing product and switching parts of it to the new infrastructure;
- QA and Customer Support helped to ensure the best customer experience despite all the changes, they went great lengths at that, too;
- DevOps/SysAdmin team tirelessly tended to the ever-changing and growing garden of services as the infrastructure evolved to adapt to the new requirements.
Let's see what was achieved in the past 3 years.
3 years ago SkuVault was based on a Lokad.CQRS framework - designed to assist in building distributed message-based systems that could be developed locally and deployed to Azure.
This implementation, later called "SkuVault V1", was good enough to build a product. However, it wouldn't scale behind a certain point. As the product grew in complexity, adding new features also became difficult.
V2 implementation allowed to scale out critical parts of the system, taking the load off the existing V1 components during the last Black Friday. V2 is also the place where the new and complex features are being implemented.
It would still take some time to migrate the remaining V1 parts to V2. However, the process is well-understood, supported by the experience and some tooling.
Wave Picking is an example of a feature that was possible to develop only on V2 design (if we wanted to support big customers with huge warehouses and diverse inventories, that is).
This feature initially evolved with the understanding of "V2 design", being used as a litmus test of its maturity and usability. It was later extended with the new functionality like "carts", "zone picking" and "partial orders" (which warranted a UI/UX overhaul).
At this moment Wave Picking is a stable feature that has been in use for a long time. A dedicated team is currently working on enhancing it with the capabilities which would let small retailers to keep it up in the game with Amazon.
Extensive telemetry infrastructure, which was non-existent just 3 years ago, is an essential part of SkuVault system these days.
There are multiple Elastic Search and InfluxDB clusters ingesting gigabytes of logs and millions of metrics on a continuous basis.
Grafana and Kibana offer a real-time insight into this data, providing tools for building visualizations, exploring the data and building up the intuition about the system behavior.
SkuVault always had some remote developers. However, 3 years ago I was the only one working from Ufa (Russia). Nowadays, there is an office with 11 people - team with the experience of working together. The experience and the roles include:
- doing business analysis and writing specs for V2 systems;
- designing and developing V2 frontend;
- developing, optimizing and maintaining V2 backend systems;
- testing and ensuring quality of the software;
- managing and organizing the development process;
- doing systems administration for the entire SkuVault and managing V2 infrastructure.
My personal goal at SkuVault is complete: the last Black Friday was boring. There also is a shared understanding about how the software system could evolve in the next years to handle more features and meet new scalability targets.
For instance, for the next year the teams at SkuVault will be quite busy completing V1 migration. However, this completion would only signify the start of the next phase of research and development towards V3.
Some of the long-term challenges include:
- completely decouple V2 implementation from the Windows Azure, once the V1 migration is complete (this involves developing a new version of Message Vault);
- build up a vision and tooling to develop a consistent user experience on multiple form factors (tablet, desktop and mobile) while running on different platforms (Android, iOS, Web and Win CE) and continuously delivering new features;
- bring platform experience to the partners, allowing them to extend user experience in SkuVault or even deliver completely new features.
If you have experience that could complement this work and are interested in relocating to Louisville, KY, then SkuVault might have a compelling job offer for you. This company is growing (which often happens when the software doesn't get in the way of the business or even empowers it a little), so they are on the lookout for the capable talent.