Data Normalization Reconsidered

January 8, 2012

Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.

In addition to its benefits, normalization also introduces several drawbacks:

  • The insert of a single logical business record requires the insertion of multiple (often many) physical rows
  • The retrieval of a single business record requires complex multi-way joins or a series of separate queries
  • Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
  • The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.

These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:

  • Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
  • Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
  • Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
  • De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.

Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:

  • A single business record often maps naturally to a single XML document
  • A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
  • If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable

When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.

My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/index.html

The 2nd part will appear soon. Happy reading!

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: