Benchmarking and tuning the stack

I focused on testing our current stack, understanding how it behaves under the load and trying to improve it. We are currently running everything in a cloud environment with VMWare virtualization, setting everything up from scratch at the beginning of the day and tearing everything down at the end of the day. This helps to focus on automation from the very start.

Our testing setup is quite simple at the moment:

  • Benchmark Box (2 cores with 4GB RAM) - we run weighttp and wrk load tests from this one.
  • Proxy and Application Boxes (8 cores with 4 GB RAM) - proxy box hosts terminator and web aggregator services, while app box hosts specialized services.
  • FoundationDB Box (2 cores with 5GB RAM) - a single FoundationDB node

Each of the boxes is by default configured with:

  • Ubuntu 12 LTS and upgraded to the latest kernel (docker and FDB need that);
  • Docker is installed (with override to let us manage lifecycle of images);
  • ETCD container is pulled from our repository and installed, using new discovery token for the cluster;
  • Logsd and statsd containers (our logging/statistics daemons) are downloaded and installed on proxy and app.
  • Appropriate services are downloaded and installed on proxy and app boxes as containers (our build script creates containers of our code, pushing them to a private docker repository)
  • FoundationDB is installed on fdb box.

All services and containers are wired into the Ubuntu upstart (equivalent of windows services management). Whenever a service starts, it interacts with ETCD cluster to publish its own endpoints or get the needed endpoints from it.

So for the last week I was polishing these install scripts (refactoring BASH is a fun exercise, actually) and also performing some tuning and optimization of that code.

We have following baseline scenario right now:

  1. We run weighttp load testing tool on bench with keep-alive, 256 concurrent clients, 2 threads (1 per core) and enough requests to keep everything busy for 10 minutes;
  2. Each http request goes to terminator service on proxy box. Terminator, running basic http server of golang, handles each http request in a new goroutine. It simply serializes request to a message and pushes it to nanobus (our own thin wrapper library around nanomsg for golang). This will create an http context, which consists of a single go channel. Then goroutine will sleep and wait for the response to arrive on that channel. Timeout is another alternative.
  3. Nanobus will add a correlationId to the message and publish it to TCP endpoint via BUS protocol of nanomsg. Semantically this message is event, telling that an http request has arrived.
  4. Any subscribed service can get this message and choose to handle it. In our case there currently is a web aggregator service running in a different container and showing interest in these messages. Nanobus in web will grab the message and dispatch it to associated method (while stripping correlationID).
  5. This method will normally deserialize the request and do something with it. Currently we simply call another downstream service through a nanobus using the same approach. That downstream service is located on another box (for a change) and actually calls FoundationDB node to retrieve stored value.
  6. When web service is done with the processing, it will publish response message back to the BUS socket of terminator. nanobus will make sure that the proper correlationID is associated with that message.
  7. Nanobus in terminator service will grab all incoming messages on BUS socket and match them against currently outstanding requests via correlationId. If match is found, then we dispatch the response body into the the associated go channel.
  8. http handler method in terminator will be brought back to life by incoming message in go channel. It will write its contents back to the http connection and complete the request. In case of timeout we simply write back Invalid Server Operation.

When I started benchmarking and optimizing this stack we had the following numbers (as reported by our statsD daemon):

  • 12.5k http requests per second handled;
  • 99th percentile of latency: ~18ms (99% of requests take less than 18 ms, as measured from the terminator);
  • CPU load on the proxy box: 9 (1 min average as reported by htop).

Here are some improvements (resulting from a few successful experiments out of dozens of failed ones):

  • Replacing BSON serialization/deserialization in nanobus with simple byte manipulation: +1k requests per second, –1ms in latency (99th), CPU load is reduced by 1;
  • Switching to new libcontainer execution driver in docker: +0.5k requests per second, –0.5ms in latency (99th), CPU load reduced by 0.5;
  • Removing extra byte buffer allocation in nanobus (halfing the number of memory allocations per each nanobus message being sent): +1k requests per second, –1ms in latency (99th), CPU load reduced by 1;
  • Tweaking our statistics capturing library to avoid doing string concatenation in cases where sample is discarded afterwards: +1.5k requests per second, –1ms latency (99th).

Hence, the final results are:

  • 18k http requests per second;
  • ~12.5ms latency (99th percentile).

Our next steps would be to add more realistic load to this stack (like dealing with profiles, news feeds and messages), while watching the numbers go down and trying to bring them back up.

- by .