A few days ago Apple open-sourced FoundationDB - a transactional (ACID) key-value database with crazy testing and ability to scale to peta-byte clusters. I wrote more on why this is so exciting in FoundationDB is Back!
A few companies have been using FoundationDB in the past years:
- Apple - bought FoundationDB 3 years ago;
- WaveFront by VMWare - running petabyte clusters;
- Snowflake - cloud data warehouse.
What I didn't tell at the time: SkuVault happily used FoundationDB in the past years as well. This is the story.
Commit Log Layer
Initially SkuVault V2 was designed to run on top of a MessageVault - a custom highly-available hybrid between Kafka and EventStore, designed to run on top of Windows Azure Storage. We referred to it as the Distributed Commit Log.
That implementation was good enough to let the development move forward.
Further on it became apparent that we wanted better throughput, predictable and low tail latencies. It was hard to achieve while writing to Azure Storage. Commits that took a few seconds were fairly common.
We also needed some kind of stream version locking while doing commits. Otherwise nodes had a slim chance of violating aggregate invariants during fail-over.
FoundationDB felt like the only option that would let us implement these features fairly quickly.
It worked as planned. Given the green light, I was able to implement the first version of the new Commit Log Layer for FDB a few days. In comparison, it took a few months to implement the initial Message Vault.
Layer implementation was simple. We designated a subspace to act as a highly available buffer to which nodes would commit their events. Key tuples included a stream id and UUID (that was also a lexicographically ordered timestamp when persisted) which ensured that:
- there would be no transaction conflicts between the different nodes, allowing high commit throughput;
- events would be ordered within the stream (and somewhat ordered globally).
Another subspace acted as a stream version map. Commit Log Layer took care of consulting with it while performing a commit. If node acted on stale data, there would be a conflict, forcing the application to back off, get the latest events and retry the operation again.
Both the commit and version check were a part of the same transaction, ensuring the atomicity of the entire operation.
Performance picture improved noticeably as well:
Given the new Commit Log implementation, server-side logic became fast. Commit latencies were low and predictable.
However, occasionally, the server would just block for a long time doing nothing. This often happened when we had to dispatch a message to Azure Queues. The tracing picture looked like this:
20ms delays were common, however we saw occasional queuing operations that blocked everything for a second or two.
Given an existing FoundationDB setup, it was trivial to add a Queue Layer that would allow us to push messages in the same transaction with the events. Tenant version checks were improved along the way (switched to async mode) leading to a prettier performance profile:
Eventually a few more application-specific layers were implemented:
- Cluster Jobs - to schedule jobs for the background workers.
- Settings and CRUD Storage - to persist metadata and documents that don't need event sourcing.
- Async Request Processing - for the cases when a client wants to process a lot of request in bulk, caring only about throughput.
Application logic can chain changes to all these layers in the same transaction, making sure that all operations either succeed or fail together.
In the end we had half a dozen of various application-specific data models living on the same FoundationDB cluster. All that - without introducing and maintaining a bunch of distinct servers.
API of the FoundationDB is quite similar to the one of Lightning Memory-Mapped Database (LMDB). Both operate with key-value ranges where keys are sorted lexicographically. Both support true ACID transactions.
- FoundationDB is a distributed database used to coordinate the cluster;
- LMDB is an in-process database with a single writer, used to persist node data.
This synergy allowed us to leverage FoundationDB layer modeling principles and primitives while working to the LMDB databases.
We were also able to use LMDB to simulate FoundationDB cluster for the purposes of local development and testing. The last part was particularly amazing.
You could literally launch an entire SkuVault V2 cluster in a single process! Nodes would still use all these FDB Layers to communicate with one another, background repartitioner would still run, jobs would still be dispatched. Except, in that case all the layers would work against LMDB.
This meant that you could at bring up the debugger, freezing the entire "cluster" at any point, then start poking around the memory and stored state of any node or process. How amazing is that?!
Done Anything Different?
Would I have done anything differently at SkuVault? Yes, I'd make a better use of FoundationDB.
- Use FoundationDB to split event blobs into chunks and store the metadata, making it easier to operate event stores at the terabyte scale. This would also simplify event versioning and GDPR compliance.
- Switch event pub-sub mechanism from blob polling to watches, reducing event delivery latency.
- Switch aggregate version locking from a single version number to aggregate conflict ranges (just like FoundationDB does it internally), increasing write throughput to large aggregates.
- Embrace the synergy between LMDB and FoundationDB, making sure that internal tools (debugging, dumps, REPL, code generation DSL) can target both from the start.
- Double down onto deterministic simulation testing, making it easier to learn, debug and develop fault-tolerant application logic.
All the way through the development process at SkuVault, FoundationDB was an invaluable asset. SysOps team particularly liked it - they never had to touch the production cluster. The cluster required zero maintenance during the Black Friday (barely breaking a sweat under the load) or when Microsoft started pushing Meltdown and Spectre patches across the cloud.
I'm overjoyed that this outstanding database is now open-source and available to everybody.