High Availability and Performance

This is a summary of interesting things we've achieved at SkuVault since the previous blog post on LMDB and emerging DSL.

System Architecture

System architecture didn't change much since the last update. However, we improved our ability to reason and communicate about it.

Let's go quickly through.

The core building block of the system is something we call a service. It has a public API which can handle read and write requests according to the public contract of the service.

service

We do an event-driven variation of service-oriented architecture. Whenever a service needs to persist a state change, instead of overwriting DB state with new values, we craft change events and append them to the distributed commit log.

event-driven service

Any interested service could subscribe to this commit log, replay interesting events and get its own version of the state. This state will stay in sync with the global commit log via the subscription mechanism.

This means that you can roll out new service versions, features or even specialized services (e.g. analytics and reporting that have different storage requirements) side-by-side with the existing services.

event-driven cluster

You could scale by placing a load balancer in front of a cluster to route different requests to the service instances that are designed to handle it. The same load balancer would allow you to fail over to replica nodes in case of outage or a simple rolling upgrade in the cluster.

This approach also allows multiple teams work together on a large system while building software on different platforms.

event-driven collaboration

In this environment, the service behavior can be expressed via its interactions with the outside world:

  • events it subscribes to;
  • API requests it handles;
  • API responses it returns
  • events it publishes to the commit log.

service contract

If we somehow were able to capture all important combinations of these interactions of the service, then we'd probably specify its behavior well enough to prevent future regressions.

This is what event-driven specifications are for.

specification

These specifications come with a few potential benefits:

  • separate design phase from coding phase (which reduces cognitive burden);
  • help you to decompose complex requirements into simple event-driven tests;
  • break rarely (see aside);
  • let you refactor services by moving around requirements, behaviors and functionality.

Imagine, you have a service with some complex functionality inside. This functionality has evolved over time. Service accumulated specifications for the behavior, weird edge cases and strange regressions that slipped through.

specs in detail

Over the time, it became apparent that this service implementation houses two distinct feature sets. They are currently intertwined but would live better if separated.

What you can do in this situation:

  • split specifications for these services into two different groups;
  • discard the implementation completely (or keep it as a reference);
  • iterate on new versions of the services until all tests pass.

spec refactoring

Throwing out the implementation logic and rewriting services from scratch might be scary. In practice it is a bit simpler and safer than it looks like:

  • these event-driven specifications capture all important design decisions and service interactions (including edge cases and error responses);
  • developers can focus on one specification at a time, iterating till all specs pass.

These specifications are the reason why SkuVault V2 backend was able to evolve through multiple storage backends till it arrived to a place we are quite happy with: LMDB.

Cassandra to LMDB

At this moment we completely migrated away from Cassandra to an embedded LMDB as a data store. Outcomes: better (and more predictable) performance, reduced DevOps load and simpler scalability model.

This also means that:

  • We have to write our own application-specific data manipulation layer, which would've been difficult without lisp DSL and code generation.
  • Our reads are served from memory (best case) or local SSD (worst case). This simplifies scenarios like route generation for wavepicking algorythm or running complex search filters - you have to worry less about the performance.

This also means that our operations are currently mostly IO-bound, while CPU is under-utilized. Were we to switch API from JSON to binary format, while also compressing event chunks, then CPU would become the next bottleneck.

For now, event replay speeds on smallish nodes look like this:

event-driven replay

This is the last step in our journey from one storage engine to another. It looks like we'll stay here for some time.

backend evolution

Idempotency didn't work out

Exactly-once delivery of messages or events in a distributed system is hard. Initially we've tried to work around this problem by enforcing idempotency in the code via auto-generated scenarios.

However, as it turned out, it can be hard or plain impossible to implement idempotent event handlers in some cases. Especially, when you have to deal with events coming from the legacy code.

So we are planning to push this deduplication concern into the subscription infrastructure. It could handle deduplication automatically by checking up on known message IDs before passing messages to handlers. New IDs could then be pushed into LMDB within the same transaction in which handlers are executed.

This would also solve an edge case when a node crashes after pushing event to the commit log but before writing down acknowledgement from it.

Cluster Simulation

At the moment of writing, V2 part of SkuVault backend has 3 different types of modules: public (serving web UI), internal (serving legacy systems) and export (handles export jobs). Each module has a couple of replicas. Public module also has 4 partitions (each with a replica).

These modules collaborate with each other via our cluster infrastructure:

  • distributed key-value store acting as event buffer, tenant version dictionary and a queue;
  • replicated commit log;
  • forwarder process - single writer responsible for pushing events from the buffer into the replicated commit log.

Existing cluster infrastructure works fine for development, qa and production purposes. It requires some setup, though.

We needed another way to run the entire cluster locally with zero deployment. That's when we finally started tapping into the simulation mode:

  • replaced the distributed key-value store with local LMDB implementation, while keeping all cluster-enabling logic intact;
  • replaced the azure-locked version of the commit log with the file-based simulation;
  • launched all modules in one process in this simulated environment that works very similarly to the production setup (there are some caveats, of course).

While it is possible to reason about behavior of the distributed system in some edge cases, seeing it happen under the debugger with your own eyes is much better. It helps to reproduce and fix infrastructure-related bugs, while building an intuitive understanding of the system behavior.

Besides, having a simulated local cluster works well for demo deployments, where you regularly wipe everything and restart from a preset state. It is also a prerequisite for on-premises deployments that can sync up with the remote data-center or fail over to it.

Note, that this isn't yet a deterministic simulation that can be used to test distributed systems. It is merely the first step in this direction.

Missing bits include:

  • bringing all event handling loops to run on a single thread (for determinism and ability to rewind time forward);
  • simulating transaction conflicts in the cluster infrastructure (current LMDB implementation never conflicts - it takes write lock);
  • simulating load balancing, its error handling and fail overs;
  • injecting faults and adding code buggification in the simulation layer and above;
  • running this setup for a couple of months.

Why Does Deterministic Simulation Matter?

Let's imagine we are building some distributed system that deals with business workflows across multiple machines. It might deal with financial transfers or resource allocation. Let's also assert that bugs in this system are very expensive.

In theory, implementing that distributed logic shouldn't be a challenging task. We either have a cluster with distributed transactions or implement a saga (I'm oversimplifying the situation, of course). We can even prepare for the problems by planning compensating actions at every stage (and covering this logic with unit and integration tests).

In practice, how do you know that this implementation would actually work out in real world? How would it deal with misconfigured routers, failing hardware or CPU bugs? What if network packets arrive out of order, double-bit flips happen or disk controller corrupts data?

Remember, we've said that bugs in this distributed system are very expensive.

This is where deterministic simulation comes into play. If you design your system to run in a virtual environment that is deterministic, simulates all external communications and runs all concurrent processes on a single thread (pseudo-concurrency), then you can get a fair chance of:

  • fuzz-testing random problems and reactions or your system to them;
  • injecting additional faults and timeouts for extra robustness;
  • being able to reproduce any discovered failure, debug and fix it.

In essence, this approach is about upfront design of the distributed system for testability and determinism. It allows to throw more machines at the problem of exploring the problem space for costly edge cases, giving us a chance to build a system that is better prepared for random failures the universe might throw at us in real life.

Zero-downtime Deployments

No much to say here. You throw a couple of load-balancers in front of your nodes, telling them how to route traffic and fail over. With that, a deployment turns into a process:

  • spin up new node instances and let them warm-up local state (if needed);
  • switch LB forward to the new version for a portion of the users (ssh, swap config and do nginx reload);
  • if all goes well - switch more users forward;
  • if things go bad - switch users back.

Switch takes only a few seconds, since we don't have long-running requests that need to be drained.

Single-Writer Caveat

The system is designed to have a single writer per process. This design decision lets us benefit from LMDB and its ability to have unlimited isolated reads in parallel.

The caveat is: write requests have to be optimized for low latency (taking more than 15-50ms per write is too much). At the same time, we have a lot to do within a write:

  • check local state to ensure that the write request is valid;
  • generate a new event (or a couple of them);
  • apply this event to the local state;
  • commit events, jobs and any other messages to the cluster
  • generate a response to the user (e.g. include newly generated IDs or totals).

As it turns out, it isn't very hard to profile performance and improve. For that we hijacked Google Chrome tracing tools.

chrome tracing

Note: for some important requests we also check with the cluster to ensure that the node has the latest version of relevant event stream. This adds a network round-trip to the commit cost.

If we find the need to increase write throughput of the system, there are a few options:

  • split the system into more partitions (since single writer limitation is per partition); a single machine could handle multiple partitions;
  • move heavy reads to read-only replicas;
  • trade off some latency for throughput by batching multiple writes together;
  • migrate from Windows to Linux/Unix with optimized network and IO;
  • switch from virtual machines to the dedicated hardware;
  • move from .NET C# to golang (or even lower).

The only memorable lesson during these optimizations was: don't use services provided by some cloud vendors, if you value consistently low latency. These services tend to value overall throughput for all customers over consistently low latency for one customer (which makes sense).

Next Problems

Below is the list of the problems that lie ahead for the project.

Development Processes and Training

Initial development processes were quite flexible. This was possible thanks to the small size of the team: we just went to the office kitchen and drew on the whiteboard till consensus emerged.

As new developers come to the project (including developers from the remote locations), it has to switch to a more formal development model. Training process, methodology and materials also have to be established.

Client libraries and DSL

Code generation of client libraries (along with the API) will speed up development inside the company, while simplifying integration story for the partners and customers.

It will save API consumers from doing double work and re-implementing half of the domain validation logic in their own language. Besides, shipping client libraries with a nice interface would save us the effort of precisely documenting our API along with intricacies of the transport layer, fail-over, back-off, authorization and version management.

Cluster Management and Visualization

Imagine a clustered system with 3-4 types of the modules, where:

  • each module has a different deployment profile (e.g. one scales by partitioning, another - by running multiple replicas, third - by adding more queue consumers);
  • each module can have multiple versions running on the cluster (some versions could be enabled, others disabled, some more - partially rolled out);
  • some modules could be in warm-up state (replaying commit log to build their local state);
  • there could be additional deployments of modules used to test or demo new features (currently incompatible with the rest of the cluster);
  • in order to save resources, multiple module instances could be crammed into a single VM (as long as doing that doesn't hurt high availability).

This situation might look like a mess. It definitely can confuse project managers, stakeholders, testers and even developers. Confusion can lead to mistakes, anger and the dark side. We don't want that.

One way to remedy the situation is to visualize cluster state for people and provide tooling to manage this state, while enforcing domain-specific rules.

For example:

  • never remove the last active node from the load balancer;
  • never add a stale node to the load balancer;
  • keep partition replicas on different machines;
  • ensure replication factor of 3 for VIP replicas.

Summary

This design isn't doing anything particularly novel. We are essentially building application-specific database engine tuned for the challenges and constraints of the domain. Event sourcing and Domain-Driven Design are used as higher-level design techniques that let as align "what people want to achieve" with "what computers are capable of" in an efficient way.

In more concrete terms, the current solution:

  • solves existing scalability problems and provides foundation for solving future challenges, should the business experience rapid growth;
  • lays groundwork for bringing in new developers, training them and organizing multi-team collaboration around the core product, services and add-ons;
  • maps a way to tackle complexity in the software, while adding new features (including some that are unique for this industry) and maintaining them through implementation rewrites.