Data Normalization Reconsidered
January 8, 2012
Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.
In addition to its benefits, normalization also introduces several drawbacks:
- The insert of a single logical business record requires the insertion of multiple (often many) physical rows
- The retrieval of a single business record requires complex multi-way joins or a series of separate queries
- Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
- The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.
These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:
- Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
- Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
- Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
- De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.
Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:
- A single business record often maps naturally to a single XML document
- A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
- If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable
When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.
My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:
The 2nd part will appear soon. Happy reading!