I focused on testing our current stack, understanding how it behaves under the load and trying to improve it. We are currently running everything in a cloud environment with VMWare virtualization, setting everything up from scratch at the beginning of the day and tearing everything down at the end of the day. This helps to focus on automation from the very start.
Our testing setup is quite simple at the moment:
- Benchmark Box (2 cores with 4GB RAM) - we run weighttp and wrk load tests from this one.
- Proxy and Application Boxes (8 cores with 4 GB RAM) -
proxybox hosts terminator and web aggregator services, while
appbox hosts specialized services.
- FoundationDB Box (2 cores with 5GB RAM) - a single FoundationDB node
Each of the boxes is by default configured with:
- Ubuntu 12 LTS and upgraded to the latest kernel (docker and FDB need that);
- Docker is installed (with override to let us manage lifecycle of images);
- ETCD container is pulled from our repository and installed, using new discovery token for the cluster;
- Logsd and statsd containers (our logging/statistics daemons) are
downloaded and installed on
- Appropriate services are downloaded and installed on
appboxes as containers (our build script creates containers of our code, pushing them to a private docker repository)
- FoundationDB is installed on
All services and containers are wired into the Ubuntu upstart (equivalent of windows services management). Whenever a service starts, it interacts with ETCD cluster to publish its own endpoints or get the needed endpoints from it.
So for the last week I was polishing these install scripts (refactoring BASH is a fun exercise, actually) and also performing some tuning and optimization of that code.
We have following baseline scenario right now:
- We run weighttp load testing tool on
benchwith keep-alive, 256 concurrent clients, 2 threads (1 per core) and enough requests to keep everything busy for 10 minutes;
- Each http request goes to terminator service on
proxybox. Terminator, running basic http server of golang, handles each http request in a new goroutine. It simply serializes request to a message and pushes it to
nanobus(our own thin wrapper library around nanomsg for golang). This will create an http context, which consists of a single go channel. Then goroutine will sleep and wait for the response to arrive on that channel. Timeout is another alternative.
- Nanobus will add a correlationId to the message and publish it to
TCP endpoint via
BUSprotocol of nanomsg. Semantically this message is
event, telling that an http request has arrived.
- Any subscribed service can get this message and choose to handle
it. In our case there currently is a
web aggregatorservice running in a different container and showing interest in these messages. Nanobus in
webwill grab the message and dispatch it to associated method (while stripping correlationID).
- This method will normally deserialize the request and do something with it. Currently we simply call another downstream service through a nanobus using the same approach. That downstream service is located on another box (for a change) and actually calls FoundationDB node to retrieve stored value.
webservice is done with the processing, it will publish response message back to the
nanobuswill make sure that the proper correlationID is associated with that message.
terminatorservice will grab all incoming messages on
BUSsocket and match them against currently outstanding requests via correlationId. If match is found, then we dispatch the response body into the the associated go channel.
- http handler method in
terminatorwill be brought back to life by incoming message in go channel. It will write its contents back to the http connection and complete the request. In case of timeout we simply write back
Invalid Server Operation.
When I started benchmarking and optimizing this stack we had the following numbers (as reported by our statsD daemon):
- 12.5k http requests per second handled;
- 99th percentile of latency: ~18ms (99% of requests take less than 18
ms, as measured from the
- CPU load on the
proxybox: 9 (1 min average as reported by htop).
Here are some improvements (resulting from a few successful experiments out of dozens of failed ones):
- Replacing BSON serialization/deserialization in nanobus with simple byte manipulation: +1k requests per second, –1ms in latency (99th), CPU load is reduced by 1;
- Switching to new
libcontainerexecution driver in docker: +0.5k requests per second, –0.5ms in latency (99th), CPU load reduced by 0.5;
- Removing extra byte buffer allocation in nanobus (halfing the number of memory allocations per each nanobus message being sent): +1k requests per second, –1ms in latency (99th), CPU load reduced by 1;
- Tweaking our statistics capturing library to avoid doing string concatenation in cases where sample is discarded afterwards: +1.5k requests per second, –1ms latency (99th).
Hence, the final results are:
- 18k http requests per second;
- ~12.5ms latency (99th percentile).
Our next steps would be to add more realistic load to this stack (like dealing with profiles, news feeds and messages), while watching the numbers go down and trying to bring them back up.