Multi-Dimensional Clustering for Tables with Native XML Data

May 3, 2009

This continues my series of posts about the new features for working with native XML data in the IBM DB2 database software.

In my previous blog post, I discussed using hash partitioning and shared-nothing architectures to improve query response times when you have large amounts of XML data. Database partitioning essentially uses a divide-and-conquer approach by spreading the data storage and query processing workloads across multiple database nodes. Another approach to improving query response times is Multi-Dimensional Clustering (MDC). MDC is an existing DB2 and InfoSphere Warehouse feature. The latest release of these products allows you to use MDC with database tables that contain native XML data.

With MDC, the database physically organizes the data to improve performance for certain types of queries. You can organize the data according to the values in one or more database columns (or dimensions as they are called in MDC). However, MDC really shines in situations where there are queries against multiple database columns. You can use both partitioning and MDC on the same database.

When setting up MDC, you specify the database columns that are commonly referenced in queries. MDC then synchronizes the physical order of data in the database with the order of values for the chosen dimensions. It also maintains special indexes that are much smaller than regular record-based indexes (and therefore can be scanned faster). In other words, you determine the information that is most likely to be used in database queries, and MDC organizes the database table to get fast performance for queries against that information.

To illustrate, consider a situation where an application frequently accesses sales information from a large table by product, region, and date. To support queries against these dimensions, an administrator can use MDC to physically organize the database rows by the values of these product, region, and date dimensions. In other words, the database “clusters” the database rows according to the values in the product, region, and time columns. Because the rows are organized in “clusters” like this, it takes less time to answer database queries because the database needs to visit less database pages to find the relevant values.

In warehouse environments, you often need to load data from an external source system (which is called roll-in) and delete data that is no longer needed for analysis (which is called roll-out). Taking advantage of the unique data layout of MDC tables allows DB2 to optimize roll-in and roll-out operations.

In attempting to describe MDC in the simplest possible terms, I have presented only the highest-level details about this feature. If you want to learn exactly what happens at the database cell, block, and page level, please read the DB2 documentation for more details.


One Response to “Multi-Dimensional Clustering for Tables with Native XML Data”

  1. Qi Jin Says:

    Conor, thanks for the detailed post. Just one add, for MDC with XML, the clustering columns are still relational, the XML can be “payload” of MDC. In this case, you also have inlined and non-inlined. In inlined case, the XML data will be sitting inside MDC cells, thus enjoys the benefit of clustering during I/O fetch. If XML is not inlined, then they will sit in XDA, with references in MDC cells. In this case, impact to relational MDC is minimized.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: