How to index XML documents in the presence of schema diversity?
July 11, 2011
XML is a flexible and extensible data format and is often used for messages or other document types whose structure can have multiple different variations or evolves over time. As a result, you might not be dealing with a single fixed document schema but with significant schema variability, or schema diversity.
Many XML database systems allow you to store documents for many different schemas (or different version of the same schema) in a single collection. This is also true for DB2, where a collection is an XML column that is not necessarily tied to a single XML schema. Hence, applications can store, index, query, validate, and update documents for different schemas in a single XML column.
When you validate different documents in an XML column against different XML schemas, DB2 remembers which document was validated against which schema. When you later retrieve a specific document, you can easily determine which schema it was validated against when it was inserted or last updated.
A common question is: How to effectively index XML documents when there is schema variability or schema diversity?
At a high level, DB2 provides three options:
- Define various individual XML indexes for the various elements and attributes that you require to be indexed for good performance
- Index all elements and/or all attributes
- Creating an XML index on the path //text() indexes all text nodes, i.e. basically all element values
- Creating an XML index on the path //@* indexes all attributes
- Use a full-text index that indexes all elements and attributes, using DB2 Text search
The good thing is that DB2 allows you to “index everything” but it doesn’t force you to do so – unlike some other database systems. This is important, because indexing all elements and all attributes in all documents requires a lot of space and impacts the performance of insert, update, and delete operations. Therefore, indexing everything should be a conscious choice.
Although full-text indexing with DB2 Text search can address schema variability, I think the decision to use Text Search should be predominantly based on the type of searches that need to be performed. If the applications commonly perform full-text search for phrases and substrings, or maybe even require fuzzy search, stemming, and other fancy text features, then DB2 Text Search should be used. However, if the data is more structural than textual in nature, and if most predicates use regular =, >, >=, etc. operators, then I would usually prefer a set of regular XML indexes.
A key benefit of XML indexes in DB2 is that they contain index entries only for those document that they match. For example:
Let’s say we have 10 Million XML documents in an XML column, for 10 different schemas, approx. 1M documents per schema. If we define an XML index on an XML element (XPath) that exists in only 1 of the 10 schemas, then this index will hold entries only for 1M documents. If this element is an optional element that occurs in only 20% of the documents for this 1 schema, then this index will hold entries only for 200,000 documents. Thus, this index will be substantially smaller than an index that contains entries for all documents.
As a result, defining a number of specific XML indexes for various elements and attributes is very space efficient! So, if you can identify which nodes need to be indexed for good performance, you can index them in a space-conscious manner.
The index maintenance cost for insert/update/delete operations follows the same guideline as in the relational world: the more indexes you define the more CPU and I/O is being used for update the indexes. But, let’s say I have defined 50 XML indexes on an XML column, and the document that I am inserting contains elements and attributes that match only 5 of those indexes, then only those 5 indexes will be updated when the document is inserted.
A large number of XML indexes defined on a single XML column is not a concern for query performance. The DB2 query compiler is very efficient at determining quickly which index can (and should) be used for a given query.
Also note that XML indexes in DB2 can be defined with wildcards, which gives you a lot of option between defining an index for just one element and defining an XML for all elements. Some examples:
- Let’s say you are indexing purchase orders and you know that all of most the customer address information should be indexed. If you know that the path to the customer address information is /purchaseorder/customer/address but the subtree under the <address> element is subject to variability, you could define an index on /purchaseorder/customer/address//text(), to index all element values in the subtree under <address>.
- If you know that there is only one level of elements under address, you could define an index on /purchaseorder/customer/address/* to index all of these elements, and maybe another index on /purchaseorder/customer/address/*/@* for the attributes of these elements, if any.
- If you know that you need to index the <zipcode> element under <address>, but the <address> can be in different subtrees for different documents, you could define an index on //address/zipcode or /purchaseorder//address/zipcode.
For more information on XML indexes, see
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0024039.html (DB2 for Linux, UNIX, Windows)