Referential Integrity and XML Data

November 18, 2008

A few people have recently asked about the ability to ensure referential integrity in XML data. In this case, they are referring to the referential integrity feature that is common to relational databases. I thought I’d take a few moments to share one possible answer with you.

Referential integrity is a relational database feature that ensures consistency is maintained between items that reference one another. For instance, if you have a database table with order information that refers to a database table with product information, referential integrity ensures that each order refers to a valid product. In technical terms, referential integrity ensures that each product ID specified in the order table (where the product ID is a foreign key in the order table), exists in the product table.

Over the years, DBAs have found referential integrity to be a valuable feature in their relational databases. And now, they are asking how to ensure referential integrity with XML data.

The XML standard does not include mechanisms for ensuring referential integrity amongst XML elements. You can use schemas to place constraints on the XML data, but these constraints are not really enforced by the database (except insofar as the database is used to validate via a schema). And the ability to place constraints on XML data doesn’t quite add up to the ability to the ability to ensure the same level of consistency checking offered by referential integrity in a relational database.

However, there is a possible answer. This is one of the cases where using a hybrid relational/XML database proves to be very useful. You extract values from certain XML elements and store those values in relational columns in the same table, and then place constraints on those relational columns to ensure referential integrity. There is a cost associated with this approach, due to increased storage and additional programming logic. However, for many, this increased cost is justified by the ability to ensure referential integrity.

About these ads

6 Responses to “Referential Integrity and XML Data”

  1. Dave Horsman Says:

    24 Nov 08

    Hi Conor,

    It has been my experience that referential integrity (RI) assures that the information required for applications will be at hand when needed. I also believe your suggestion would be very appropriate in areas such as document and image management. These comments are not a criticism of your suggestion but I believe we have to follow your reasoning where it leads.

    Though it might not seem critical, RI is useful in one regard in that it avoids routing applications through code that handles exceptions. At the same time, it is fairly rare to be missing table information on recent transactions. Having worked with databases that do not enforce “strict” RI, I am not so much in disagreement about the usefulness of RI but rather view it as being insufficient as an absolute minimum where databases are concerned.

    On mature applications (or in a typical business,) the business rules related to how long various information is kept and under what circumstances it can be removed are invariably more complex. I’m sure our clients would agree that “I don’t want the computer to forget important information” would be an unstated assumption. Instead, there will likely be one set of involved criteria for customer information distinct from transactions history, annotations, inventory, accounting, documents and so on.

    In this respect your suggestion has some merit by allowing (database) “triggers” to fire when an attempt is made to delete information. This has the benefit of allowing more complex business rules, but also of keeping the logic on the database server. I believe that this is a far more practical aspect than RI alone.

    Moving on from that point, given that disk and servers are not cheap and network bandwidth continues to be an important issue, your suggestion draws attention to some important issues. XML though readable by a person, is not terribly efficient in how it stores information. At the point where a company has decided to mirror (critical file information) such as primary, foreign and candidate keys in a relational database, it would probably be best to keep what is left in the database as well in that XML can easily be generated from the native database whenever it is needed.

    Though often cited for it’s ability to transfer database information as well as web pages, XML is particularly useful for storing documents (including software), lists, parameters and a variety of different information. It is extremely important as the “core” of the many other industry standards now in use. I would add further that there are a number of ways to reduce it’s inefficiency where storage and network traffic are concerned, such as JSON.

    Thus ends the part of my post that was itself in human readable form. Thanks for your suggestion Conor. Could you check with some of your team as to the statistical variation in property values for various entity types. At what point does the ability to leave out properties that are absent, null or have a default value become advantageous in terms of space? Given the above I see XML as still being particularly relevant as a means of adding additional properties, headers and exceptional data.

    Kindest regards,

    Dave Horsman


  2. First of all, I definitely agree with Dave that RI in the database is very important, and sometimes its importance is underestimated. Sometimes people say “RI is ensured by our application”, and while this may work in some cases it can go wrong and may even stay undetected and cause applications to use inconsistent data.

    I’d also like to comment on Dave’s statement that
    “XML though readable by a person, is not terribly efficient in how it stores information. At the point where a company has decided to mirror (critical file information) such as primary, foreign and candidate keys in a relational database, it would probably be best to keep what is left in the database as well in that XML can easily be generated from the native database whenever it is needed.”

    Yes, XML in its textual format is verbose and not very efficient. That’s why DB2 pureXML doesn’t store it as text, but in parsed tree structures where element and attributes are replaced by integer values and various other storage optimizations are applied. Makes a huge difference for efficiency. So, the idea is not to have XML documents in the file system plus some extracted values in the database, but to store XML documents in XML type columns of the database, with additional values in relational columns of the same table. Hybrid. This is important for consistency, backup and recoverybility, etc.

    I’m not sure if Dave meant to say that you might as well shred the XML data completely to relational tables because it’s easy to generate the XML documents again. Often, this approach does not work well, especially if the XML format is complex and variable so that shredding and reconstructing is (prohibitively) complex. I have seen applications where each XML document had between 200 and 2000 nodes, and the optimal solution was to store each full document in an XML column and have just 5 element values duplicated in relational columns of the same row.

    I’m not sure I fully understand what Dave means by “statistical variation in property values for various entity types”. But I can comment on his question “At what point does the ability to leave out properties that are absent, null or have a default value become advantageous in terms of space?” Let’s look at some industry-specific XML Schemas. The FIXML schema (financial information exchange) defines approx. 3100 distinct elements and attributes, most which are optional. The FpML schema (finacial products markup language) defines ~2000 distinct elements and attributes, most which are optional. In ARTS (Retail) it’s about 8000, and in the IRS1120 schema about 13,000. In HL7 (health care), the clinical document architecture alone defines 1400 elements and attributes. For all of these (and many other) industry schemas we see that any given instance document typically contains only a fraction of all the possible elements and attributes that are defined in the schema. The fact that XML allows all the non-applicable elements to be absent from an instance document is absolutely critical in order to keep these documents managable. If, for example, a FIXML message contains 200 tags out of 3100 possible tags, and if it does not include the remaining 2900 tags with empty values, then this make the message a lot smaller. This means it’s faster to transmit, faster to parse, and it takes less space to store in a database. All of this comes back to Dave’s very true statement that “disk and servers are not cheap and network bandwidth continues to be an important issue”.

  3. Dave Horsman Says:

    Thanks Matthias for your response. You not only clearly understood the question but your examples are precisely what I was looking for. In terms of shredding the data it seems that my views on storing an “element id” is hardly original but I guess that paired lists is a fairly standard tool in our industry. I am sure that your team’s choice of which columns to include and index was a carefully considerer one.

    To further clarify the statistical question, beyond the answer you supplied I was curious about the frequency of use of each element. For example, it might be safe to say “all” “client entities” have a name and 99.95% have a phone number. 98% (or far less) might have an email address but from that point the occurrence of a given element would progressively be less predictable.

    HL7 would be an excellent place to apply analysis of this sort. Given that your strategy is already efficient in terms of storage it has a great deal of potential both in terms of a broad range of information being kept as well as a marvellous potential in terms of offering an important benefit.

    Thanks again,

    Dave H.

  4. Dave Horsman Says:

    Hi Conor,

    I seem to be making a career of writing this last week but fear not, I won’t quit my day job. Your RI posting regarding XML and hybrid technology requires a close look at relational theory, it limitations and the appropriateness of using XML with bulk transaction processing. As I invited comments on your own technology I felt I should repost here to be fair. It might be appropriate for some of the IBM crew to visit IT Tool Box’s DW group and enlighten the masses. I know I could still use some enlightenment.
    ===================================================
    WILL XML KILL THE DATA WAREHOUSE?
    The following post was in response to a posting warning against the use of XML in data warehouses, warnings to not break Codd’s rule 10 and perform processing in the database.
    Link: http://datawarehouse.ittoolbox.com/groups/message.asp?type=strategy-planning&v=dw-projectmanagement&i=2493497
    ====================================================

    Thank you for responding with additional information, your experience in this area is appreciated. I was not disagreeing with any of your original post so much as placing focus on the importance of you DW folks getting involved (IN XML) and why. Though I would like to see you continue experimenting, it would be um… silly…to ever store transactional data in it’s native XML format no matter how few transactions a business has or how they were delivered over a distributed network.

    I had to research to see it there were a special RULES FOR BI DW as this is the second time I have seen references to RULE #? (COBB’S RULE 10) in this forum and I make no claims to being the foremost expert in DW.. I will do a new post on this subject but in summary I must stress that database experts frequently make us of denormalization to gain performance improvement in their designs. Performance is all in bulk transaction processing.

    TRIGGERS and processing in databases are based upon leveraging set theory in the systems, by not moving database data to code, calculating and then returning it to the database. It is optimization related to processor, network and memory usage. This is not denormalization in the traditional sense but there is very sound rationale behind these technology.

    I am a very strong advocate of the use of mathematics in database theory and the need for advances in this area. At the same time, Codd’s attempts at formalizing database theory through set theory is excellent and reflects the deeper intent in the vast and complex schema produced in XML without the penalties of forcing an instance of every column present in the schema. Relational database purist have the same appreciation for the importance of mathematics in databases and cognitives sciences as I but it is imperative that we get past these repetitive debates on theory and advance these sciences into new territory.

    All large enterprise databases and applications avoid any but the most basic processing like the plague. You folks in DW and BI being the technologist that led business to see the wisdom of doing more with this data, but still very efficiently.

    At the small to medium scale and where bulk is not such an issue other requirements are a higher priority. Comments that appear in SQL regarding how horrid concepts such a cursors are can be quite amusing but are not incorrect, they come from a different perspective. I appreciate the rationale being applied here and why DW’s views lean this way.

    In our applications, we don’t do badly in dealing with transactions volumes and we strongly emphasize performance and disk storage optimization but we deal with different requirements. In my own work I continue to draw from work in the DW and back office areas as quite important. Hardware and performance are a big issue at any scale and your strategies and “rules” translate to a competitive advantage for any scale application and enterprise.

    I have no issue with your point in that you are speaking of millions of transactions where performance and TOC are dependent on extreme efficiency and optimization. Your comment regarding engines being a very good example, I would love to here what you would say regarding SQL cursors which would not be too kind I have no doubt.

    I was wondering if you had any chance to check the performance stats on IBM’s Native XML and similar developing technologies. They claim that they are getting to the point where the can co-exist and perform keeping in mind that the bulk transaction aspects are already well handled.

    I would really appreciate additional comments from yourself or anyone else in this group particularly if you could quantify any of these concerns or experiments.

    To return to Patrick, the man is giving you very sound advice here in plain english (TO NOT STORE XML IN NATIVE FORMAT) despite our little side bar on processing in the database. He is quite correct, ignore it at your peril there are many ways to misuse these technologies.

    Respectfully,

    Dave Horsman


  5. Hi Dave,

    There is no question that XML is verbose. This is because XML is human readable. In many environments, the fact that XML is human-readable is a great advantage when different groups are working with the data.

    There is also no question that working with data in XML format typically takes longer than working with data in a “fixed schema”. However, in many environments, XML’s flexibility introduces other efficiencies that outweigh the faster processing. For instance, more efficient storage or easier development, which can have a very real impact on IT budgets.

    There is no one-size-fits-all answer for representing data. A choice of whether to persist in native XML format should be made on a case-by-case basis, and depends heavily on the environment and intended use of the information.

    The verbosity of XML is becoming less of an issue now that databases are using innovative ways to address this issue. For instance, DB2 replaces XML tag names with unique integers, maintaining the tag names in a lookup table. This not only reduces much of the bloat, but it also allows for faster performance because XML node comparisons are now integer-based operations instead of string-based operations. For example, if you look at the recent 1TB benchmark, the 1TB of raw XML data requires less than 450GB of storage thanks to these techniques and others. The XML bloat argument is not as strong as it once was.

    As regards the analysis of data in native XML format… comparing it to the analysis of data in a traditional relational format is not an apples-to-apples comparison. The analysis of data in fixed-schema tables is fundamentally different to the analysis of data in tree-like structures. That being said, there are two things to keep in mind:

    Recent benchmarks are demonstrating that performance for native XML storage is making significant progress. Progress that may surprise some people who have previously dismissed native XML storage. For transactional systems in particular, check out recent TPoX results.
    I wouldn’t be surprised if we soon see some of the techniques that the industry uses for the analysis of RDBMS data are adapted for use with native XML data. And I also wouldn’t be surprised if we also soon see some additional innovations in this regard.

    Like any trade-off, there are advantages and disadvantages to persisting in a native XML format. There are times when performance is better for native XML storage, and there are times when performance is better using other representations. Sometimes other considerations, like development costs, admin costs, storage costs, and so on outweigh performance considerations.


  6. The original discussion in this post is from the point-of-view of someone who works in a relational database environment. Although XML itself does not include mechanisms to ensure the kinds of referential integrity that relational database administrators are seeking, there are a number of technologies related to XML (like XML Schema, Schematron, and so on) that do allow you to ensure various types of referential integrity.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 50 other followers

%d bloggers like this: