A 10TB XML Data Warehouse Benchmark
July 14, 2010
In my previous post I promised to write more about the 10TB XML data warehouse benchmark that IBM and Intel performed earlier this year.
Before I get into the details of that benchmarking exercise, let me talk about how and why we got there.
The first wave of XML database applications were (and continue to be) operational systems, often performing message-based transaction processing with XML as the message format. Corresponding to the adoption and growth of such operational XML applications, IBM and Intel have collaborated to verify that current hardware (such as Intel servers) and software (such as DB2) were able to meet the demands of XML transaction processing systems. This started with moderate 50GB and 100GB XML benchmarks in 2006, all the way to the industry’s first 1TB XML benchmark in late 2008.
As companies accumulate large amounts of XML data in their operational system, they realize that they are sitting on a goldmine of information and are eager to use it for reporting and business intelligence purposes. Accustomed to the benefits of mature relational data warehouses, many companies now require and expect the same capabilities for XML data. They want to run complex analytical queries over XML. Therefore we want to establish performance and scalability proofpoints to show that this is indeed possible.
Benchmark goals and methodology
The goal of the 10TB benchmark was to demonstrate linear scalability when an XML data warehouse grows in size. Many data warehouses grow over time as new data keeps being added to the existing data. Running analytical queries over larger amounts of historical data can provide a more accurate understanding of the business and allow more accurate predictions for the future.
The value of linear scalability is the following: you can increase the data volume in your warehouse, add computing resources (CPUs, disks, memory) proportionally to the data, and hence keep the performance of your workload constant. For example, if a query analyzes 2x as much data and is given 2x as much resources, then the elapsed time should be the same.
An equivalent way to think about linear scalability is that you keep the data volume constant and e.g. double the hardware resources to reduce the response times by half.
In this benchmark we ran a decision support workload -consisting of 16 complex analytical SQL/XML queries- on two databases:
(A) 3.33TB of raw XML data and using 1/3 of the available computing resources
(B) 10TB of raw XML data and using all of the computing resources
We expect to see the same performance (response times, throughput) on both databases. On the 10TB database each query processes 3x as much data as on the 3.33TB database. But, with 3x as much resources, this will be accomplished in the same amount of time as on the 3.33TB database. That’s linear scalability.
The 10TB benchmark configuration
DB2 9.7 Fixpack 1, using pureXML and compression, was run on a cluster of three Intel Nehalem EX Servers (Xeon 7500). Each of the servers had four 8-core CPUs and 128GB of memory. The operating system was Linux RHEL 5.4. The storage subsystem was an IBM DS8700 system, housing 48 RAID5 arrays of 8 disks each. The DB2 database partitioning feature was used, running 48 database partitions on this cluster, i.e. 16 on each of the three servers.
The 3.33TB database used only one of the three servers in the clusters, 16 database partitions, and only 16 of the 48 RAID5 arrays.
Data Volume and Workload
We used the XML data generator of the TPoX 2.0 benchmark.
Database A: 3.33 TB of raw data, about 1.83 Billion XML documents:
- 1,666,500,000 orders
- 166,500,000 customers with their accounts
- 20,833 securities
Database B: 10TB of raw data, about 5.5 Billion XML documents:
- 5,000,000,000 orders
- 500,000,000 customers with their accounts
- 20,833 securities
Each of the two databases contained three partitioned tables, one for order documents, one for customer-accounts, and one for securities. About 20 XML indexes were defined to support the decision support workload. The 16 SQL/XML queries included full table scans, grouping and aggregation, OLAP functions, joins across two or all three XML tables, and various combinations of XML predicates. The same 16 queries were run on both databases. The selectivity of each query (in terms of percentage of XML documents touched) is the same in both databases.
The following chart (click it!) compares the response times of the 16 queries in the 3.33TB database (yellow bars) and the 10Tb database (red bars). For each query we can see that the response is approximately the same on both databases, which confirms the linear scalability that is so important in data warehousing.
I have purposefully omitted absolute numbers from the vertical axis in this chart, because this result and this benchmark was not about maximizing absolute performance. It was about scalability, i.e. relative performance between two different scale factors and configurations. For example, some queries could have been even faster -in both databases- if we had used twice as many DB2 database partitions per server.
A summary article on this and other XML database benchmarks can be found in the IBM Data Management Magazine.