Use Hash Partitioning for Fast Analysis of XML Data
May 1, 2009
DB2 raised some eyebrows when it used Intel-based hardware to process more than 6700 XML-based transactions per second in the TPoX benchmark. Some people who—for performance reasons—had previously discounted XML as a data format, decided to take another look. It became apparent that, with recent advances in server hardware and database software, good performance is possible for XML-based transactional systems.
Before we knew it, these same people were asking for performance breakthroughs for analytical workloads. Now that they had information in XML format and they could work with that XML data in native format, they wanted to be able to analyze the information in the XML data. In other words, they wanted to be able to extract business intelligence directly from the native XML data. To help these organizations, IBM added a number of new features that improve performance for analytical workloads. In this blog post, I will talk about the addition of pureXML support to database partitioning (which has previously been called Database Partitioning Feature or DPF).
If you need to analyze large amounts of native XML data, IBM offers InfoSphere Warehouse (which is powered by DB2). When you have very large amounts of data in the warehouse, database partitioning helps you scale the data warehouse by spreading the data across multiple database partitions. When storing a row of data, database partitioning uses the value of a hash function to determine the partition in which to place the data. Each row of data is stored only in one partition.
When InfoSphere Warehouse receives a SQL or XQuery request, one of the partitions coordinates the query response. It instructs each partition to execute the query and it assembles the responses for the query result. Because each partition executes the query in parallel (in other words, at the same time), you get fast responses to queries against large amounts of XML data.
Database partitions can be on the same machine, or they can be on separate machines.
If you have multiple partitions on the same machine, you will typically have at least one processor core for each partition. For example, let’s say you have a single machine with two dual-core processors. That is, you have a total of four processor cores, then you may decide to have 4 partitions. Of course, you could configure more partitions than you have processor cores, but you wouldn’t enjoy an performance gains due to parallel processing.
When you have partitions on different machines, InfoSphere Warehouse uses a shared-nothing architecture to scale to hundreds or even thousands of machines. Let’s say you choose to have eight machines, where each machine has 4 processor cores. In such a case, you could choose to have 32 partitions (eight machines multiplied by four cores). With such a shared-nothing architacture, you can start your warehouse with moderate hardware, and add additional machines as your data volume grows.
You can use database partitioning to speed the loading, inserting, querying, updating, deleting, validating, and publishing of XML data. Database partitioning is ideal for situations where you have very large volumes of XML data, or for situations in which you have complex analytical queries that take a long time.
Here is some feedback from participants of the Early Access Program:
“By using XML to store the data and XQuery to query it, we have been able to run complex queries over massive amounts of data and provide researchers timely access to patient’s detailed medical records in a very manageable and understandable format. We are also in a better position to identify patients that meet the criteria for clinical studies and adjust their course of treatment at the appropriate time. We are now excited about the scalability advantages that the DPF features of DB2 will deliver. The parallelization of queries and inserts of the XML data will allow our researchers to gain access to the data they require even faster, which gives them an advantage in requesting grants and doing further research.”
—Tom Holdener, BJC HealthCare
“UCLA Medical Center has been leveraging DB2 pureXML to keep more unstructured patient records online and provide more comprehensive health care. The hospital is experiencing the compression benefit reducing the amount of storage space for patient’s medical records in XML by 50% today. In addition to the UCLA enterprise-wide medical record repository, there are three clinical applications currently under development and being re-architected to implement the DB2 9.7-scalable XML features at the core as a result of extensive collaboration with the IBM development team. The enhancements will allow ULCA to generate business intelligence using XML data to help meet patient care needs.”
—Charles M. Wang, UCLA Health System