Friday, November 23, 2007

How to create a canonical data model?

When used properly the canonical form can provide great benefits in an SOA world such as loose coupling of applications, ease of integration maintenance and a common understanding of information, but when used improperly a canonical form can create a maintenance nightmare. So, how do you create the canonical form for an object?

I have seen several different approaches to addressing this issue that fall into two basic categories: create a superset of information or create the minimal subset of information. The sections below describe those approaches and their drawbacks.

Use all of the information that is available in the source system since sooner or later some application will want it.
This approach is probably the simplest way to create a canonical format, essentially every piece of information that is known about an object is passed around regardless of its usefulness to other applications. This approach has several easily identifiable drawbacks:
  • The size of your canonical form will be unnecessarily large for what is needed. This means that the consuming applications will have to sift through all of the unneeded data to get to what it needs. This adds to the potential for development bugs as developers try to figure out what fields they need to use.
  • The application that produces the event must needlessly create and provide information that no other application will never use.
  • The XML representation of the form will be large and consume a lot of resources creating, transporting, and consuming the message.
  • The benefit of this approach is that since the canonical form already has all of the information that can be provided so you will never have to modify your canonical form (Until the source application is enhanced with new information :)).
Try to think of every possible piece of information that any application now or in the future could ever need.This approach is another way to create a superset of fields in the canonical form.

  • This approach obviously will take a long time to determine all of the information that will be required by receiving applications.
  • I have seen several integration projects try to take this approach and fail before they even define the canonical form and start to integrate things.
  • Each application will have their own unique set of information that they require and all of the other applications will need to sift through it needlessly.
  • The benefit of this approach is if you can actually create a true superset then you will never have to modify your canonical form.
  • If you can predict what information future applications will need you are in the wrong business you should be perdicting the stock market.

Create a base form and then as new applications are added the new data fields are added to the canonical format.
This approach starts out as a minimal subset approach and then quickly turns into a superset.
  • The canonical form will grow to be extremely large as more and more applications are added.
  • Each time the canonical form is modified/updated all of the consuming applications need to account for the changes.
  • Eventually you will end up with the unmanageable supersets described above.
  • The benefit to this approach is that you can create the initial canonical form fairly quickly since it will contain only the information required by the currently know consuming applications.

Provide only the minimal amount of information required to identify an object as unique.
  • This approach is obviously the simplest to create. All that is required are the fields that make an object unique.
  • This approach normally leads to unnecessary work/overhead because now every application needs to make an extra service call back to the source to retrieve the information that is not in the message. As more and more applications are added the burden on the source application becomes greater until it can no longer handle the load.
So, what is the answer? How can you create the canonical format that provides all of the perceived benefits without the maintenance nightmare?

There are a couple of good approaches to take and rules of thumb that will provide the balance between to much information and too little information in your canonical form.

  1. Start small - Begin with the unique fields that make an object unique.
  2. Add common fields - Add the fields that are common among most of the consuming applications that you are currently working with. This one can be tricky so I normally go with the rule of thumb that if 80% of the applications need it I should provide it.
  3. Add information that is expensive to retrieve later - If there is information that is not in the message that is required by a consuming application, the application will need to retrieve it from the source application. If retrieving it from the source application is expensive (either expensive to recreate the information or their would be a high volume of retrievals) then provide that information to the canonical form.
Following these simple rules should prevent you from changing the form a lot. If the canonical form is good you should only need to change it when:
  • The source application is enhanced/upgraded to contain additional information that is needed by 80% of the consuming applications.
  • The cost of consuming applications retrieving additional information from the source application becomes to high.
If you find that your canonical form is changing a lot after it is initial created instead of just adding new fields every time step back and try to determine why all of the additional fields are required and how they were missed in the initial creation process.

In the end the canonical form should contain 80-90% of the information that is required by all of the consuming applications (the ultimate superset). This will minimize the time spent creating the initial canonical form and reduce the number of times that the canonical form needs to change.

There are also several standards organizations (like the Object Modeling Group) that have already created object representations that can be used as a starting point. These forms have been thought out over the years by many industry experts that are very knowledgeable about their respective spaces. Use these forms as a starting point, as they normally contain a lot of additional information that you may not need and can trim out.

Now that you have a canonical form in place the dirty word "governance" comes into play as other developers in other groups need information that is not already in the canonical form. A governance model must but put in place to prevent fields from being added to the form to make integrating to a single consuming application easier. Governance must look at the use of the canonical form as a whole and not just change the form to please the needs of one applications as this will lead to a complete superset and too many changes to the form.

What is a canonical data model?

Recently, I have been asked a lot "What is a canonical data model?". In the SOA world, the term "canonical data model" is thrown around as a way to impress the lay person. In reality the canonical data model is a simple concept:


The canonical data model (CDM) is a representation of common information produced and consumed by applications. The CDM is normally used to publish events in the form of messages out of one application into several another applications. The CDM is used as the format of the message so that all of the receiving applications know what information to expect in the message.


For example lets create the canonical data model for a person. The CDM for a person must include information that uniquely identifies that person:

  • Name
  • Birth Date
  • Social Security Number/Passport number

The information that is unique must be included in the CDM so that it is easy to distinguish if two person messages relate to the same person or different people. That is the lowest common denominator for a canonical message. A canonical message normally contains other useful information about an object that is used by most of the receiving applications. Continuing to build out the CDM for our person object, these other useful fields could be included:

  • Work Address
  • Home Address
  • Work phone number
  • Home phone number
  • Cell phone number
  • Fax number

As you can see the list of information describing a person can go on and on. The information each application needs to know about a person can vary greatly, so how do you know what to put in the canonical data model? In my next post, I will talk about how to create a canonical data model that provides the benefits without running into a maintenance nightmare.