Here is a short video showing Matthias Nicola speaking about XML in the data warehouse at the IDUG conference. He talks about the new features in DB2 that support native XML data in data warehouse environments. Apologies for the choppy nature of the video. It was taken by hand with my inexpensive pocket camcorder. You can click on the HQ button in the YouTube viewer to see the higher quality version.

Attendees of the upcoming Information on Demand conference in Berlin will get a Poken with their conference badge. Poken is a new tool that offers a smart way to network and share data.

When you insert the Poken into the USB port of a computer, you are connected to the Poken website where you fill in your personal data and create a profile, including links to your profiles on social networking sites like Linked-In, Facebook, Twitter, Myspace, and so on. This will then be stored on your Poken.

To network with someone at the conference, simply hold your Poken up to theirs and exchange IDs, creating an electronic handshake. The next time you connect your Poken to a computer and go to your profile page, you will see all the profiles of exchanged IDs together with their Social Networking sites. This enables you to easily stay in contact post-conference. No more dog-eared business cards surfacing weeks later!

Not only that, but you can also use your Poken to facilitate information download. For example, you can easily obtain session information and session presentations. This is my virgin Poken experience at a conference, so I am really curious to see how it works out…

Greetings from the IDUG North America Conference in Denver, Colorado. IDUG is the International DB2 Users Group—an independent, not-for-profit organization for DB2 users by DB2 users. If you are a DB2 user, IDUG provide an invaluable resource. Here is a video showing a few short glimpses from Day 1 of the conference:

This continues my series of posts about the new features for working with native XML data in the IBM DB2 database software.

Compression reduces the amount of storage space needed for data. Data storage costs money, so minimizing this cost is very important for many organizations. Especially when storage costs can be reduced by 60% to 80%. Storage-related costs include the actual storage devices themselves, the power consumed by those storage devices, and the time spent maintaining these devices.

Another benefit of data compression is that it often improves database performance. Because the data requires less disk space, you typically have reduced levels of disk I/O activity, which can improve database performance. Also, because more data is being cached, you may also enjoy improved buffer pool hit ratios. In many cases, the performance gain due to reduced I/O and better memory utilization outweighs the extra CPU cycles required to compress and decompress the data.

When storing XML data, DB2 typically places the XML data in a location called the XML Data Area (XDA). However, if the XML data is less than 32KB in size, it can be stored with the relational data (this is called inlining).

With DB2 9.5, you can compress XML data that is inlined, allowing you to reduce storage for XML data. For instance, the XML transactions in the TPoX benchmark are typically smaller than 32k, allowing them to be inlined and compressed. In the most recent TPoX benchmark, one terabyte of raw XML data is stored in 390 gigabytes of storage, giving a compression ratio of 61%.

DB2 9.7 extends compression to all XML data, regardless of whether it is in the XDA or inlined. In other words, DB2 9.7 can compress XML data, regardless of size. (The maximum size of an individual piece of XML data that can be stored in DB2 is 2 gigabytes.)

The degree to which XML data can be compressed depends on the nature of the XML data. IBM has tested the new data compression features with six different data sets. Three of these data sets were supplied by IBM clients, and represent real world client usage. The other three data sets represent XML data sets available in the public domain. The data sets include XML documents that range in size from 2KB to 100MB. The following diagram shows the storage savings that have been achieved (this diagram is from Cindy Saracco and Matthias Nicola’s article titled Enhance business insight and scalability of XML data with new DB2 V9.7 pureXML features).

As you can see, compressing XML data typically results in 60 to 80 percent disk space savings with DB2 9.7.

Finally, I’d also like to mention that if you compress XML data, you can also compresses any indexes for that XML data. Compressed indexes also reduce physical I/O and increase buffer pool hit ratios, which often leads to a net performance gain.

This continues my series of posts about the new features for working with native XML data in the IBM DB2 database software.

In my previous blog post, I discussed using hash partitioning and shared-nothing architectures to improve query response times when you have large amounts of XML data. Database partitioning essentially uses a divide-and-conquer approach by spreading the data storage and query processing workloads across multiple database nodes. Another approach to improving query response times is Multi-Dimensional Clustering (MDC). MDC is an existing DB2 and InfoSphere Warehouse feature. The latest release of these products allows you to use MDC with database tables that contain native XML data.

With MDC, the database physically organizes the data to improve performance for certain types of queries. You can organize the data according to the values in one or more database columns (or dimensions as they are called in MDC). However, MDC really shines in situations where there are queries against multiple database columns. You can use both partitioning and MDC on the same database.

When setting up MDC, you specify the database columns that are commonly referenced in queries. MDC then synchronizes the physical order of data in the database with the order of values for the chosen dimensions. It also maintains special indexes that are much smaller than regular record-based indexes (and therefore can be scanned faster). In other words, you determine the information that is most likely to be used in database queries, and MDC organizes the database table to get fast performance for queries against that information.

To illustrate, consider a situation where an application frequently accesses sales information from a large table by product, region, and date. To support queries against these dimensions, an administrator can use MDC to physically organize the database rows by the values of these product, region, and date dimensions. In other words, the database “clusters” the database rows according to the values in the product, region, and time columns. Because the rows are organized in “clusters” like this, it takes less time to answer database queries because the database needs to visit less database pages to find the relevant values.

In warehouse environments, you often need to load data from an external source system (which is called roll-in) and delete data that is no longer needed for analysis (which is called roll-out). Taking advantage of the unique data layout of MDC tables allows DB2 to optimize roll-in and roll-out operations.

In attempting to describe MDC in the simplest possible terms, I have presented only the highest-level details about this feature. If you want to learn exactly what happens at the database cell, block, and page level, please read the DB2 documentation for more details.

DB2 raised some eyebrows when it used Intel-based hardware to process more than 6700 XML-based transactions per second in the TPoX benchmark. Some people who—for performance reasons—had previously discounted XML as a data format, decided to take another look. It became apparent that, with recent advances in server hardware and database software, good performance is possible for XML-based transactional systems.

Before we knew it, these same people were asking for performance breakthroughs for analytical workloads. Now that they had information in XML format and they could work with that XML data in native format, they wanted to be able to analyze the information in the XML data. In other words, they wanted to be able to extract business intelligence directly from the native XML data. To help these organizations, IBM added a number of new features that improve performance for analytical workloads. In this blog post, I will talk about the addition of pureXML support to database partitioning (which has previously been called Database Partitioning Feature or DPF).

If you need to analyze large amounts of native XML data, IBM offers InfoSphere Warehouse (which is powered by DB2). When you have very large amounts of data in the warehouse, database partitioning helps you scale the data warehouse by spreading the data across multiple database partitions. When storing a row of data, database partitioning uses the value of a hash function to determine the partition in which to place the data. Each row of data is stored only in one partition.

When InfoSphere Warehouse receives a SQL or XQuery request, one of the partitions coordinates the query response. It instructs each partition to execute the query and it assembles the responses for the query result. Because each partition executes the query in parallel (in other words, at the same time), you get fast responses to queries against large amounts of XML data.

Database partitions can be on the same machine, or they can be on separate machines.

If you have multiple partitions on the same machine, you will typically have at least one processor core for each partition. For example, let’s say you have a single machine with two dual-core processors. That is, you have a total of four processor cores, then you may decide to have 4 partitions. Of course, you could configure more partitions than you have processor cores, but you wouldn’t enjoy an performance gains due to parallel processing.

When you have partitions on different machines, InfoSphere Warehouse uses a shared-nothing architecture to scale to hundreds or even thousands of machines. Let’s say you choose to have eight machines, where each machine has 4 processor cores. In such a case, you could choose to have 32 partitions (eight machines multiplied by four cores). With such a shared-nothing architacture, you can start your warehouse with moderate hardware, and add additional machines as your data volume grows.

You can use database partitioning to speed the loading, inserting, querying, updating, deleting, validating, and publishing of XML data. Database partitioning is ideal for situations where you have very large volumes of XML data, or for situations in which you have complex analytical queries that take a long time.

Here is some feedback from participants of the Early Access Program:

“By using XML to store the data and XQuery to query it, we have been able to run complex queries over massive amounts of data and provide researchers timely access to patient’s detailed medical records in a very manageable and understandable format. We are also in a better position to identify patients that meet the criteria for clinical studies and adjust their course of treatment at the appropriate time. We are now excited about the scalability advantages that the DPF features of DB2 will deliver. The parallelization of queries and inserts of the XML data will allow our researchers to gain access to the data they require even faster, which gives them an advantage in requesting grants and doing further research.”
Tom Holdener, BJC HealthCare

“UCLA Medical Center has been leveraging DB2 pureXML to keep more unstructured patient records online and provide more comprehensive health care. The hospital is experiencing the compression benefit reducing the amount of storage space for patient’s medical records in XML by 50% today. In addition to the UCLA enterprise-wide medical record repository, there are three clinical applications currently under development and being re-architected to implement the DB2 9.7-scalable XML features at the core as a result of extensive collaboration with the IBM development team. The enhancements will allow ULCA to generate business intelligence using XML data to help meet patient care needs.”
Charles M. Wang, UCLA Health System