Popular Categories
Friday
Jul302010

Challenges of the Code Documentation

Here's the interesting problem.

There are numerous situations, when code contains a lot of important information. This important code can change really frequently too.

Let's say that we need to relay this important information for somebody who is not intimately familiar with the codebase. For example:

  • Researchers depending on the conventions and transformations in some data pumping project.
  • New users being introduced into some project via articles with a lot of samples.
  • Managers, requiring knowledge of some business constants and rules.
  • 3rd party Developers, that have to integrate with some API, while having the access to the latest samples, restrictions and constraints.

Needless to say, that important code pieces could be scattered across multiple projects, adding friction to people that need to have a look at them fast.

We do want to have this friction at minimum! This way we increase the chances that some questions could be resolved by looking at the documentation, instead of wasting time and potentially involving somebody else into this quest for the answers. Saved time essentially translates in reduced expenses and faster reaction of an organization (resulting in improved ability to compete on the market).

There also might be some important contextual information about this code. It might or might not be valuable for the certain party, but developers would want to write it somewhere (enabling them to forget details and free Brain RAM for the other tasks). Comments usually help here, but they have to stay with the code and are limited to the plain-text (no graphs, images, tables or even bold).

One common way of relaying this information (in some specific context) is to document the code in external docs, while including the latest snippets. However, the code tends to change a lot. This is especially true for fast-paced environments with tight feedback loops and low-friction development (and deployments).

So we have got ourselves a problem here:

  • we either need to waste time and concentration on updating the documentation after every significant code change (i.e.: a few times a day);
  • or we have to accept the fact that the documentation is out-of-date and essentially useless;
  • or we have to include links like: "for the actual details look in the method DoomsdayMachine.RefreshWorld() and any other methods it might call". We'll also need to remember to update the links, should the class be renamed or moved.

One logical solution is to have auto-generated documentation that could be compiled from some text, while automatically linking to the code sources. And it has to survive refactoring and class renames.

I know that Lokad researchers use LaTeX with some scripts for such tasks. However the whole LaTeX thing looks a bit of overkill here, plus I'm not sure it can bind to some MSIL-level markers within the .NET code, while providing common publishing functionality.

Ideally this would work like this:

  • Project has documentation files stored and versioned side-by-side with the sources (ideally in the same solution).
  • These documentation files are expressive enough to contain graphs, images, tables and all the other nice publishing things, while referencing some code blocks in the project.
  • Editing the documentation would be WYSIWYG-friendly, while the original document format would be friendly to the version control (and seeing the changes).
  • Changing the original code (i.e.: adding a few lines in the beginning of the file, or moving method around) should not break the documentation.
  • Whenever needed (or continuously on the integration server) these separate doc files are assembled and rendered to the desired publishing format (i.e.: online docs or PDF).
  • Any document-level compilation problems are detected immediately (i.e. when building documentation).

Does anybody have similar problems and ways of solving them? What do you think?

Thursday
Jul292010

Lokad CQRS - V1 for Windows Azure in September  2010

Quite a few developers expressed interest in using Lokad.CQRS for production purposes. We would like to thank them for the inspiration this gives to us.

There is an important note, however. Although this project is already used for high-scale production scenarios, the core design and libraries have not been stabilized, yet. This means, that method names and configuration conventions are likely to change, as we publish more features and try to provide more consistent usage experience across various Lokad libraries.

There are also a few core CQRS pieces, that need to make it into the public domain, before we could have logically complete functionality for version 1 of Lokad.CQRS for Windows Azure:

  • Relational message locks.
  • View adapters for MS Sql and Azure Tables.
  • Integration with Azure storage helpers from Lokad.Cloud.
  • Web application sample (showcasing full CQRS stack).

The expected delivery of Lokad CQRS v1 - September 2010. For the consistent production experience with Lokad.CQRS it is recommended to wait till this version comes out.

Wednesday
Jul282010

Lokad CQRS - Debugging Service Bus Failures

In the previous post on Lokad CQRS project we've covered usage of Protocol Buffers as the recommended serialization for your service bus messages. Let's continue the discussion with debugging these messages and keeping an eye on our workers in general.

It is hard to beat Protocol Buffers serialization in performance, size, cross-platform and evolution-capabilities at once. That's why Google uses them extensively for message serialization and persistence.

However, people tend to have concerns about this format: "Humans can't decipher messages by opening them in the notepad, so it's too complex." This reason tends to negate all the benefits.

Well, you generally should not be able to read messages in notepad, in the first place (security concerns). However, should you really need this (in case of exceptions, for example) it is trivial to print any ProtoBuf in human-readable format:

PingPongCommand
{
  "Ball": 161,
  "Game": "My Game"
}

Let's dig into Sample-04 from Lokad.CQRS Guidance.

This sample is a PingPong implementation based on Sample-01. It brings along a few improvements. They include ProtoBuf serialization (which has been covered in the previous article), ability to record failing messages and a few more features to talk about later.

As you know, Message Handler Feature (which implements core Service Bus functionality within Lokad.CQRS App Engine), is capable of retrying messages that fail. If message fails continuously (current default value is 4 retries), then we consider it to be poison and move it to the corresponding queue, to keep processing healthy. Poison queues are auto-created, just like any other.

We demonstrate this behavior by throwing exceptions at random. The ping-pong bouncing will never stop, unless we get really unlucky (meaning 4 exceptions in a row with probability of 17% each). In the latter case failing message will go to the "samples-04-poison" queue.

In order to fix the problem post-mortem, we need to know the exception details. For example, we could instruct our App Engine to print message processing details to the Azure blob storage. Configuration is trivial:

// we'll handle all messages incoming to this queue
builder.HandleMessages(mc =>
{
  // .... configuration skipped ....      

  // let's record failures to the specified blob 
  // container using the pretty printer
  mc.LogExceptionsToBlob(c => {});
});

This will create text file with exception details on every message that failed for this handler. It will be located in "errors" blob with the content like:

PingPongCommand

  Topic          : PingPongCommand
  ContractName   : PingPongCommand
  Sender         : http://127.0.0.1:10001/devstoreaccount1/sample-04
  Identity       : 1ff9a799-f442-4d27-abd8-9dc100cc851c
  CreatedUtc     : 07/28/2010 12:24:38

Exception
=========
Type: Sample_04.Worker.BounceFailedException
Message: Bouncing failed for: Ping #161 in game 'ea'.
Source: Lokad.Cqrs.Stack
TargetSite: Void InvokeConsume(System.Object, System.Object, System.String)
StackTrace:

  // stack trace skipped

This text includes message attributes (system information persisted by Lokad.CQRS in Lokad Message Format) and exception details with the preserved stack trace. Files are given names in the pattern of (date-azure-message-id):

errors/2010-07-28-12-24-38-1ff9a799-f442-4d27-abd8-9dc100cc851c.txt

We can tune the exception details to be more friendly, through:

mc.LogExceptionsToBlob(c =>
  {
    c.ContainerName = "sample-04-errors";
    c.WithTextAppender(RenderAdditionalContent);
  });

First statement provides custom container name, second - appends custom content renderer, using JSON.NET library. This renderer might look like:

static void RenderAdditionalContent(UnpackedMessage message, 
  Exception exception, TextWriter builder)
{
  builder.WriteLine("Content");
  builder.WriteLine("=======");
  builder.WriteLine(message.ContractType.Name);
  try
  {
    // we'll use JSON serializer for printing messages nicely
    var text = JsonConvert.SerializeObject(message.Content, Formatting.Indented);
    builder.WriteLine(text);
  }
  catch (Exception ex)
  {
    builder.WriteLine(ex.ToString());
  }
}

This will include actual message details into the text file:

Content
=======
PingPongCommand
{
  "Ball": 161,
  "Game": "ea"
}

You can use something like Open Source Azure Storage Explorer to keep an eye on message handling failures.

Azure Storage Explorer

As soon as we discover new error, we could investigate the problem, fix and deploy new version. If we move poison message back to the processing queue - the process will resume where it has failed. This simplifies debugging and fixing long-running server-side processes.

Also, if the problem was caused by some connectivity issue or DB failure, that are known to be fixed, then there is even no need to debug or deploy - just resend the poisons.

However, manually checking your message queues for the problems is not really the best way. It takes away precious time, focus and does not guarantee real-time response. We would need something more flexible and efficient.

We'll discuss that in the next article in Lokad CQRS Guidance series. Meanwhile, you can download the samples and subscribe to the updates.

Dear Reader, what do you think about that?

Saturday
Jul242010

Lokad CQRS - Using Protocol Buffers Serialization for Azure Messages

Lokad CQRS, just like any other Application Engine, can use multiple serialization formats to persist and transfer messages. We've tried various options, starting from the XML serialization and up to BinaryFormatter and WCF Data Contracts with binary encoding.

They all had their own issues. Serialization format that had performed best in our production scenarios is called Protocol Buffers.

Protocol Buffers

Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.

ProtoBuf.NET is a great implementation of ProtoBuf serialization for .NET by Marc Gravell (you probably saw him on the Stack Overflow).

Lokad.CQRS uses ProtoBuf serialization internally for transferring and persisting system message information. ProtoBuf serialization is also the recommended approach for serializing messages in Lokad.CQRS:

  • format is extremely compact and fast, better than XML Serialization, Data Contracts or Binary Formatter (see below);
  • format is evolution-friendly from the start (renaming, refactoring or evolving messages gets much simpler);
  • format is cross-platform.

Here's how the performance looks like, when compared to the other .NET options (details):

Proto Buf Performance

However, in Lokad.CQRS you should not worry about these specifics and potential problems that have been taken care of. You just define message contracts with ProtoBuf attributes:

[ProtoContract]
public sealed class PingPongCommand : IMessage
{
  [ProtoMember(1)] public int Ball { get; private set; }
  [ProtoMember(2)] public string Game { get; private set; }

  public PingPongCommand(int ball, string game)
  {
    Ball = ball;
    Game = game;
  }

  PingPongCommand() { }
}

and switch to this serialization in the domain module:

.Domain(d =>
{
  // let's use Protocol Buffers!
  d.UseProtocolBuffers();
  d.InCurrentAssembly();
  d.WithDefaultInterfaces();
})

More information is available in ProtoBuf in Lokad.CQRS documentation. Additionally, Sample-04 (in the latest Lokad.CQRS code), shows implementation of Ping-Pong scenario with ProtoBuf.

Starting from this sample, we'll be using Protocol Buffers as the default serialization in our samples.

BTW, side effect of using fast and compact ProtoBuf serialization is that it increases overall performance. Smaller messages are less likely to exceed 6144 byte limit of Azure Queues. App Engine handles such messages by saving them in Azure Blob Storage. This essentially allows to persist messages as large as a few GB. Yet, second round-trip to Blob is something that we would want to avoid, if possible. ProtoBuf serialization in Lokad.CQRS helps to significantly improve our chances here.

By the way, if you still think that ProtoBuf makes it hard to view and debug failing messages, check out the next article!

Saturday
Jul242010

Lokad CQRS - Message Throttling and Auto Scaling in the Cloud

There was an interesting question on implementing message throttling with Lokad.CQRS in Windows Azure recently.

Message throttling answers questions like:

How can I throttle messages to ensure that a specific endpoint does not get overloaded, or we don't exceed an agreed SLA with some external service?

In essence, we might need to ensure, say, that the handler does not get more than 3 messages of some kind per second to process. Technical solution to this problem with Lokad.CQRS (leveraging advanced scheduler scenario) has been posted in community (feedback is welcome).

However, let's consider the implications of this scenario. For example, we are throttling messages per customer, to ensure that if customer X sends 100 messages at once, this does not slow down responsiveness by customer Y.

If we implement throttling, this will solve the problem for customer Y. Yet, the proper implementation (resistant to Azure Worker Role recycles and other failures) will add quite a complexity to the solution, increasing maintenance and development costs. As we know from experience, linear increases in solution complexity tend to raise related costs exponentially till they become prohibitively high.

Another side effect is that our customer X (which might be a rather important customer) will get somewhat slower experience, loosing some business value.

Third problem is: message throttling does not address the situation, where the number of customers increases and at some point 2000 of them just happen to send a 5-10 messages in short interval of time (customers tend to do that). The numbers might as well be under the SLA, so throttling will not apply at all. Yet, every single customer will experience slow-down in processing.

Fortunately, in CQRS scenario, in this case at least UI will still stay extremely responsive for all read and browsing operations under such usage spikes.

How do we deal with such problem? I would advise to consider scaling out scenario instead of passive message throttling. Since we are running in the cloud (or can burst tasks to it), we can just ask it to increase the number of workers processing the message load, in response to the increased load.

Alternatively we can actively track average loads (i.e.: message waiting and processing time by type of the message and some properties), predicting them and deploying more workers in advance.

At the moment Lokad.CQRS does not have dynamic scalability capabilities. Lokad.Cloud has this in form of Auto Scaling, which works rather nicely in the real-world scenarios. However, rather soon we might need to introduce this to Lokad CQRS as well (either in form of reusable module coming from Lokad.Cloud or a separate scalable host functionality).

Any feedback on such feature is welcome, as it will help to shape such functionality to be compatible with the needs of the community and not just the self-tuning CQRS vision of Lokad.

So, what do you think?