Part 2 of our article “Data normalization reconsidered” is now available at

http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/index.html

The second part discusses alternatives to a traditional normalized relational representation of data. Such alternatives include for example XML, JSON, and RDF because they can often help you overcome normalization issues or improve schema flexibility, or both. In the 21st century, digitized business records are often created in XML to begin with, which makes XML an attractive choice as the database level storage format.

This article also contains a performance comparison between XML and relational data that was conducted for a real-world application scenario at a major international logistics company.

At the end of the article you find comparison tables that summarize the pros and cons of different data representations.

Advertisements

Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.

In addition to its benefits, normalization also introduces several drawbacks:

  • The insert of a single logical business record requires the insertion of multiple (often many) physical rows
  • The retrieval of a single business record requires complex multi-way joins or a series of separate queries
  • Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
  • The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.

These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:

  • Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
  • Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
  • Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
  • De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.

Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:

  • A single business record often maps naturally to a single XML document
  • A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
  • If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable

When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.

My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/index.html

The 2nd part will appear soon. Happy reading!