Structure-agnostic XML indexes in DB2

July 19, 2010

Structure-agnostic XML indexes are indexes that you can define without knowledge about the structure and format of the XML documents that will be stored and queried. That is, the XML element and attribute names and their nesting is not known in advance. In such cases, the future queries and their search conditions are often also unknown.

Whether you need structure-agnostic XML indexes or not depends on your XML application. Let’s consider two cases:

1. Known XML Schema (often good for path-specific indexes):

Many XML applications have an XML Schema that defines the expected and permissible XML document format, i.e. you do know the XML element/attribute names and their nesting ahead of time. With some knowledge of the application you can then make an educated guess which elements or attributes are likely to be used in search conditions or join predicates.

With this knowledge you can define path-specific XML indexes, to index just those elements and attributes that likely appear in predicates. This is similar to the common practice for relational database tables, where you don’t index all columns but only those that are frequently constrained in queries.

For example, if you expect predicates on the zipcode of customer address information, and you know that the path is /customer/addr/zipcode, then you can define the foll0wing index in DB2:

create index zipIdx on customer(xmldoc) generate keys using
xmlpattern '/customer/addr/zipcode' as sql varchar(20);

DB2 also offers an XML index advisor that looks at an XML workload (a set of XML queries) and recommends a set of beneficial XML indexes for specific elements and attributes.

Path-specific indexes have the benefit that you index only what needs to be indexed, which conserves storage space and supports high insert/update/delete performance for XML documents.

Interestingly, Microsoft SQL Server does not support path-specific XML indexes. It forces you to index ALL elements and ALL attributes, even if you require only some of them to be indexed.

2. Unknown XML Schema (structure-agnostic indexes):

If the schema and expected queries are unknown, then you can choose to “index everything”. For example, in DB2 you can define the following two indexes. The first index covers the text nodes of all XML elements, and the second index covers all XML attributes.

create index elemIdx on customer(xmldoc) generate keys using
xmlpattern '//text()' as sql varchar(7985);

create index attrIdx on customer(xmldoc) generate keys using
xmlpattern '//@*' as sql varchar(7985);

Subsequently, query predicates on an element’s text node or on any attribute can use one of these indexes.

These two indexes are limited to element and attribute values of up to 7985 bytes per value, which is more than sufficient for most applications. Alternatively you can also define these indexes as VARCHAR HASHED, to index strings of unlimited length via hashing. Either way, these indexes are always maintained synchronously, i.e. upon or before the commit of any insert, update, or delete operation on the collection of XML documents.

Another option for structure-agnostic XML indexing is the DB2 Full-Text search feature, which I wrote about in February.

Structure-agnostic indexes that “index everything” are somewhat more expensive than selective path-specific indexes in terms of space and insert/update/delete performance. But, that’s the price for complete flexibility and for supporting unknown queries over unknown documents.

For more information on XML indexing in DB2, see the article on index exploitation or Chapter 13 in the DB2 pureXML Cookbook.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: