Andrew Eisenberg is co-chair of the W3C XML Query Working Group. Andrew and I are delivering a free Webcast about XQuery. We will discuss XQuery, provide some tips for working with XQuery, and also talk about the future direction of XQuery. Make sure to sign up for this Webcast today.

Schema Flexibility

August 21, 2008

Often people overlook schema flexibility when evaluating their XML storage needs.  Before getting locked into an XML database decision, you should understand your XML schema needs and how the different XML databases support those needs.  After all, you don’t want to have a nasty surprise when implementing your application. 

An XML schema defines the structure of XML data. For example, an XML schema describes the actual XML elements and attributes that can appear in XML data, their data types, as well as describing where they can appear and how often they can appear. Often when working with XML data, you need to ensure that the XML data adheres to an XML schema. This is frequently referred to as checking that XML data is valid, or validating the XML data. Validating XML data is a good idea because it ensures that you will not encounter issues when working with the data.

Schema flexibility refers to the ability to cater for a wide range of XML schema needs; needs that go beyond simply associating a single XML schema with a database column.

Consider the situation encountered by tax authorities.  Although this scenario applies to any environment where data formats are subject to change.

An increasing amount of tax information is being filed using online forms.  Online tax forms are a more efficient and cost effective way to gather tax information.  XML is a natural choice as the format for such information because of its suitability for data exchange with other tax authorities.

Each year, there is the likelihood of minor changes to the tax forms. If the tax forms change, then it is likely that the XML schema will also change.  In such situations, simply using the new schema to validate all XML data in that column is not sufficient. Instead, you need to validate each record against the schema that was in existence when that record was created. In other words, you need the flexibility of having the cells in a database column validate against different schemas.

Another aspect of schema flexibility is the ability to enter XML data into a database without validating the XML data during the database insert.  You may want to validate the XML data as part of another step in your application.  To cater for such situations, you should ensure that your XML database management system does not force you to validate the data upon entry.  You would be surprised how often this situation arises.

These are only a couple of considerations to keep in mind.  The main point is to make sure to understand your XML validation needs, and to make sure that the XML database you choose meets those needs.

Tomorrow I’ll blog about XML schema evolution…

If you are looking for an introduction to the topics of storing and retrieving XML data, this free on-demand webcast is hopefully a good place to start:

On demand Webcast: A primer for storing and retrieving XML data

XML is unchallenged as a data exchange format. Most industries have developed standards that use XML as the data format. The list of standards is seemingly endless: ACORD, FIXML, FpML, MISMO, HL7, NIEM, OTA, SVG, TAX1120, UBL, XBRL, and so on.

Your immediate inclination is likely to store such information in its native XML format. After all, when the information is exchanged, it is expressed in XML format. Therefore it seems natural that it would also be stored in a native XML format. Often this is the case. However, it is not always the case. Sometimes it makes more sense to store the information in relational form. This blog post discusses when to use the two approaches.

The key question that you need to ask yourself is whether the information is best expressed in a relational format or in an XML format. Because XML is so popular as a platform-neutral and transparent data exchange format, data is often sent as XML even when XML is not the ideal format for the information. For instance, information is sometimes force-fitted into a hierarchical data structure even when it has no hierarchical characteristics. If this is the case, examine the data to see if it should in fact be stored in a relational format. The general rule is that you should consider a relational structure when the data is better described in tabular format. This is the characteristic that most readily defines the relational schema.

There are also situations when you may be compelled to use a relational schema. For example, if the application that consume this information work only with relational data, then you need to take this into consideration.

You should store information in XML format when:

  • The information is best represented in a hierarchal format.

    When you store hierarchical information in a relational database, you often require a large number of tables. Creating and maintaining such systems is complex and costly. Multiple joins result in complicated SQL statements. In such cases, you should instead use native XML storage.

  • The schema is subject to change.

    Because relational tables are highly-structured, changing the schema is both difficult and costly. Updating a schema for natively stored XML, on the other hand, is a relatively painless exercise. If you expect the schema to change, you should use native XML storage.

  • There are many null values in the relational tables.

    Relational database tables set aside storage for each column. If your information does not populate all of these columns, you can have a large number of null values in your database. This, in turn, affects both storage requirements and performance. Again, you should consider native XML storage in such situations.

Many people come here while searching for the best native XML database. Being employed by a vendor with a leading native XML database, I cannot make impartial judgements in this regard. So I’m not going to try. I can, however, make sure you you know how easy it is to evaluate my employer’s native XML capabilities, so you can make up your own mind.

Not only does IBM offer a production-quality native XML database for no charge. It also provides publication-quality books about it for free. This makes it very easy for you to get started with XML databases.

DB2 9 Express-C has no data storage limits. You can store as much data as you like in the database. DB2 9 Express-C has no evaluation time limit. You can use it as long as you like. The only limits are that you use a server with a maximum of 2 cores and 2GB of RAM. Download it from the DB2 Express-C Web page.

IBM has published two books about native XML storage. These books are available for purchase. However, you can also freely view these books as HTML or download the PDF version for offline viewing and printing. Click on images below to see these books:

DB2 9 pureXML Guide DB2 9: pureXML Overview and Fast Start

Good luck with your evaluations.

= = = = = = = = = = = = = = = = = = = = = = = = = = =

!! Additional comment, December 2012:

NOTE:  This redbook “DB2 pureXML Guide” was a great resource for DB2 9.1. However, DB2 versions 9.5, 9.7 and 10.1 have added a lot of additional XML features and enhancements that are not covered in this early redbook.

Hence, this redbook is now outdated.

For more up-to-date information on DB2 pureXML, see the DB2 pureXML Cookbook, the redbook “Extremely pureXML in DB2 10 for z/OS“, or the articles listed on this page:
http://www.ibm.com/developerworks/wikis/display/db2xml/Technical+Papers+and+Articles

 

Some of you have been asking for more information about XQuery versus SQL/XML. In particular, it appears that you are interested in understanding the levels of support for common operations. I’ll take a few moments to compare both XQuery and SQL/XML in IBM DB2 9.  However, please note that not all vendors provide the same levels of support. For instance, all vendors do not support sub-document update, and those that do support it may not implement the XQuery standard. So please, before making any decisions, verify the levels of support provided by your vendor.

Operation XQuery SQL/XML Comments
Inserting an XML document No Yes You use SQL to insert an entire XML document.
Retrieving an XML document Yes Yes  
Retrieving part of an XML document Yes Yes  
Using predicates with relational data Yes Yes XQuery does not support relational predicates. However, IBM DB2 supports SQL in XQuery, allowing predicates with relational data.
Using predicates with XML data Yes Yes  
Deleting an XML document No Yes You use SQL to delete an entire XML document.
Updating an XML document Yes Yes  
Updating part of an XML document Yes Yes  
Joining XML data Yes Yes Using XQuery is the easier approach. Using SQL/XML is typically difficult to code.
Joining XML wth relational data Yes Yes XQuery does not support joins to relational data. However, IBM DB2 supports SQL in XQuery, allowing joins to relational data.
Transforming XML Yes Yes Using XQuery is the easier approach. Using SQL/XML is typically difficult to code.
Aggregating XML data Yes Yes Using SQL/XML is the easier approach. Using XQuery is possible with embedded SQL, but is typically difficult to code.
Calling external functions No Yes  
Passing parameter markers No Yes  

At first glance, it may appear that SQL/XML has more extensive support. However, this is in part because logically-speaking certain tasks do not belong in XQuery.  Also note that some tasks are easier to code with XQuery.  This ease of coding can make a significant difference in some environments.

XML-only Databases

May 13, 2008

Some database vendors offer XML-only databases. That is, these vendors offer native XML databases that do not support other types of data.

Such databases work well for isolated XML data. You can consider such databases if you do not expect your XML data to work with other information in your organization. However, if you choose this route, and later want to analyze or use the XML data with other related information, you may at best encounter a costly and troublesome integration effort. Also, keep in mind that there will be performance issues resulting from the data being in non-integrated databases.

Many people say that the recent high tech past was about business automation, and that the impending future is about business optimization. In other words, now that you have used technology to automate many aspects of your business, you will focus on optimizing these systems to gain competitive advantage. To do this, you will want to integrate and leverage all information assets in your environment and thus obtain maximum value from those assets. This is a long-winded way of saying that it is highly likely that you will need to harness all information in your environment–including both XML data and the existing data in other formats. In addition, this existing data is not, and probably never will be, stored in XML. If you use an XML-only database, integrating the XML data with other information assets will likely pose a significant problem.

On a parting note, consider the practicalities of working with multiple interfaces for different databases, multiple data management tools, different hardware and software systems, different maintenance schedules, as well as the lack of advanced data management features in XML-only databases, it all adds up to an unwieldy and expensive proposition.

When you take all this into consideration, is an XML-only database a wise choice?