Migrating Legacy Systems to Event Sourcing

These days I'm working on migrating really legacy system towards the simplified CQRS/DDD design with event sourcing for the cloud.

As part of the migration process, I'm reverse engineering legacy SQL database into a stream of events. These events are not precise representation of what has happened in the past (this exact information is irreversibly lost, as in almost any data-driven system), but rather a pretty good estimate that could be used to prepopulate the new version.

Essentially, reverse engineering events is about writing a throw-away utility that will scan database tables (MS Access files or punch-cards) and spit out events that could be used to reproduce that state.

For instance, consider this customer record in DB table:

Customer {
  Name : "GoDaddy",
  Id : SomeGuid,
  Created : 2008-13-12,
  Status : Deleted,
  Phone : "111-22-22",
  Reason : "Supporting SOPA was poor PR move"

This record could be reversed into the following events

  Id: SomeGuid, 
  Name: "GoDaddy", 
  Created: 2008-13-12
  Id: SomeGuid, 
  Phone: "111-22-22"
  Id: SomeGuid, 
  Reason: "Supporting SOPA was poor PR move", 
  Deleted: 2011-12-24

Note, that we actually had to improvise while coming up with this event stream: date of deletion was not stored in the original database (we were losing this information). So we are just substituting some predefined date here (i.e. date of upgrade to CQRS/DDD+ES).

When you have a system with a few years of history, quite a few events are generated. The system that I'm currently migrating has data that dates back to the early dates of Lokad, hence 300-400 thousand events is something expected.

As part of development process, these events are run through the aggregate state objects and also through the projections. The goal here is to pass all possible sanity checks and get read models that match exactly to the UI currently visible in the old system. If new system looks and behaves exactly like the old one (even if the guts are completely simplified), then we are moving in the right direction.

Obviously, during this process, a lot of problems show up, especially with logically inconsistent or corrupt data (i.e. accounting inconsistencies caused by race conditions and dead locks in the legacy database). These things are generally to be resolved manually - there is no magical silver bullet.

- by .