Benchmarking and tuning the stack
I focused on testing our current stack, understanding how it behaves under the load and trying to improve it. We are currently running everything in a cloud environment with VMWare virtualization, setting everything up from scratch at the beginning of the day and tearing everything down at the end of the day. This helps to focus on automation from the very start.
Our testing setup is quite simple at the moment:
- Benchmark Box (2 cores with 4GB RAM) - we run weighttp and wrk load tests from this one.
- Proxy and Application Boxes (8 cores with 4 GB RAM) -
proxy
box hosts terminator and web aggregator services, whileapp
box hosts specialized services. - FoundationDB Box (2 cores with 5GB RAM) - a single FoundationDB node
Each of the boxes is by default configured with:
- Ubuntu 12 LTS and upgraded to the latest kernel (docker and FDB need that);
- Docker is installed (with override to let us manage lifecycle of images);
- ETCD container is pulled from our repository and installed, using new discovery token for the cluster;
- Logsd and statsd containers (our logging/statistics daemons) are
downloaded and installed on
proxy
andapp
. - Appropriate services are downloaded and installed on
proxy
andapp
boxes as containers (our build script creates containers of our code, pushing them to a private docker repository) - FoundationDB is installed on
fdb
box.
All services and containers are wired into the Ubuntu upstart (equivalent of windows services management). Whenever a service starts, it interacts with ETCD cluster to publish its own endpoints or get the needed endpoints from it.
So for the last week I was polishing these install scripts (refactoring BASH is a fun exercise, actually) and also performing some tuning and optimization of that code.
Currently we are using plain bash scripts to set up our environment. However bash scripts are just like imperative languages: they tell exactly what you want to do in steps. Iād see that trying out more functional tools might be beneficial for us in the longer term (ansible, puppet, chef or something like that).
We have following baseline scenario right now:
- We run weighttp load testing tool on
bench
with keep-alive, 256 concurrent clients, 2 threads (1 per core) and enough requests to keep everything busy for 10 minutes; - Each http request goes to terminator service on
proxy
box. Terminator, running basic http server of golang, handles each http request in a new goroutine. It simply serializes request to a message and pushes it tonanobus
(our own thin wrapper library around nanomsg for golang). This will create an http context, which consists of a single go channel. Then goroutine will sleep and wait for the response to arrive on that channel. Timeout is another alternative. - Nanobus will add a correlationId to the message and publish it to
TCP endpoint via
BUS
protocol of nanomsg. Semantically this message isevent
, telling that an http request has arrived. - Any subscribed service can get this message and choose to handle
it. In our case there currently is a
web aggregator
service running in a different container and showing interest in these messages. Nanobus inweb
will grab the message and dispatch it to associated method (while stripping correlationID). - This method will normally deserialize the request and do something with it. Currently we simply call another downstream service through a nanobus using the same approach. That downstream service is located on another box (for a change) and actually calls FoundationDB node to retrieve stored value.
- When
web
service is done with the processing, it will publish response message back to theBUS
socket ofterminator
.nanobus
will make sure that the proper correlationID is associated with that message. Nanobus
interminator
service will grab all incoming messages onBUS
socket and match them against currently outstanding requests via correlationId. If match is found, then we dispatch the response body into the the associated go channel.- http handler method in
terminator
will be brought back to life by incoming message in go channel. It will write its contents back to the http connection and complete the request. In case of timeout we simply write backInvalid Server Operation
.
When I started benchmarking and optimizing this stack we had the following numbers (as reported by our statsD daemon):
- 12.5k http requests per second handled;
- 99th percentile of latency: ~18ms (99% of requests take less than 18
ms, as measured from the
terminator
); - CPU load on the
proxy
box: 9 (1 min average as reported by htop).
Here are some improvements (resulting from a few successful experiments out of dozens of failed ones):
- Replacing BSON serialization/deserialization in nanobus with simple byte manipulation: +1k requests per second, ā1ms in latency (99th), CPU load is reduced by 1;
- Switching to new
libcontainer
execution driver in docker: +0.5k requests per second, ā0.5ms in latency (99th), CPU load reduced by 0.5; - Removing extra byte buffer allocation in nanobus (halfing the number of memory allocations per each nanobus message being sent): +1k requests per second, ā1ms in latency (99th), CPU load reduced by 1;
- Tweaking our statistics capturing library to avoid doing string concatenation in cases where sample is discarded afterwards: +1.5k requests per second, ā1ms latency (99th).
Hence, the final results are:
- 18k http requests per second;
- ~12.5ms latency (99th percentile).
Our next steps would be to add more realistic load to this stack (like dealing with profiles, news feeds and messages), while watching the numbers go down and trying to bring them back up.
Published: March 19, 2014.
Next post in HappyPancake story: Change of Plans
🤗 Check out my newsletter! It is about building products with ChatGPT and LLMs: latest news, technical insights and my journey. Check out it out