I have previously covered how DB2 9.7 supports native XML data with hash partitioning (database partitioning), range partitioning (table partitioning), and multi-dimensional clustering. These new features make it feasible to analyze information in native XML format, side-by-side with relational data, in a data warehouse. And, of course, being able to work with native XML data in such scenarios offers many efficiencies and advantages.
In reality, many data warehouse projects involve pulling different types of information from disparate data sources around an organization. My colleagues have published the first in a series of two articles that provide step-by-step instructions for integrating information from such disparate sources into a data warehouse. The first of those articles is now available at IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate XML operational data into a data warehouse.
This article tells you how to use IBM® InfoSphere™ DataStage to extract and transform XML data managed by DB2® pureXML®. It also explores how DataStage can load this data into a table with traditional SQL data types, and a table with both relational and XML columns. The article includes sample scripts and data that you can download.
The second part of this article series will explore another important scenario: using DataStage to read information from a flat file, convert the data into an XML format, and load this XML data into a data warehouse that contains a table with a DB2 pureXML column.
DB2 pureXML Cookbook – 45% Discount
August 13, 2009
In the past, I discussed the DB2 pureXML Cookbook. This book is very valuable for all DB2 pureXML users, from novice through expert.
If you are interested in buying this book, please be aware that International DB2 User Group (IDUG) members get a 45% discount on IBM Press books. IDUG membership is free. For information about how to get the discount, visit the following Web page: http://www.idug.org/public-spotlights/45-book-discount.html
Also, as a special promotion, IDUG have a competition where they are giving away 3 copies of this book as prizes. For information about entering to win a free copy of the book, visit the following Web page: http://www.idug.org/public-spotlights/free-db2-book.html
SOA Projects: IBM DB2 versus Oracle Database
August 7, 2009
If you are implementing a SOA environment, Solitaire has a very interesting finding for you. Solitaire authored a whitepaper where they analyze database operations at more than 4,100 production systems. As part of their analysis of database operations on IBM System p, they looked at the correlation between the success rate of SOA projects and the choice of database software.
To classify a SOA project as successful, they asked the organization if they now enjoy a 25% or more increase in resource utilization and a 30% or more increase in the speed of provisioning. Here is a chart that shows the relative success rates for SOA projects that involve IBM DB2 and Oracle Database. Solitaire do not say why DB2 does so much better. Perhaps DB2’s superior native XML storage is a factor?
You can read the full Solitaire Report at Whitepaper: DB2 Performance on IBM System p® and System x®.
Almost 24 million people in the US are diagnosed with diabetes. If you know someone with diabetes, you know about the hassles that constant monitoring imposes. MyCareTeam and IBM have collaborated to improve continuous monitoring in such situations, with a solution that both reduces costs and improves the quality of healthcare. I am particularly interested in this collaboration because it involves the use of XML data. IBM and MyCareTeam have written a great paper that covers a number of topics that will be of interest to those in the diabetes and healthcare technologies fields. For instance, there is information about the use of technologies like XML storage and Web services in the context of continuing care. There is also information about related initiatives such as the Continua Health Alliance’s role in selecting appropriate standards. You can read more at Healthcare in the Home: Continuing Care for Diabetes with Collaborative Technologies.
Article about XML for Healthcare
July 17, 2009
The latest issue of IBM Database Magazine has an interesting article titled Healthcare’s XML Heartbeat. In this article, Ken North describes the rise and rise of XML in the healthcare industry. He talks about the key role that XML is playing in the emergence of electronic medical records, the efficient exchange of information, and increasing levels of interoperability. The article gives great insight into the XML-based electronic medical records environment at UCLA Health System.
Why won’t Oracle publish results for the Transaction Processing over XML (TPoX) benchmark?
We know that Oracle has implemented TPoX demonstration and test systems. Oracle has demonstrated TPoX systems at their conferences. Also, Oracle has included TPoX tests and data in their research efforts and as part of their X-Files demonstration. So we know that Oracle has used TPoX. Why won’t they publish benchmark results?
Oracle claims that the TPoX benchmark is narrowly scoped and that it doesn’t handle the diverse use cases of XML. They are correct in that TPoX does not model multiple scenarios. It models only one scenario… a security trading scenario that uses a real-world XML Schema (FIXML). Such a scenario involves a high volume of relatively small XML documents. The benchmark takes into account write, update, delete, indexing, XML schema, logging, concurrency, and other database considerations. While the TPoX benchmark does indeed model only one scenario, it makes sure to incorporate a real-world mix of XML-related operations for that scenario.
Database benchmarks are always focused on a specific usage scenario, and TPoX is no exception. Relational database benchmarks have always taken the same approach: TPC-C focuses on OLTP systems, TPC-W on web-based transaction systems, TPC-H on ad-hoc decision support systems, TPC-R on decision support systems with precomputed and materialized views. There are database benchmarks that focus on SAP workloads, and so on. The reason for this approach is that combining all these diverse use cases into a single benchmark would lead to a test scenario that does not represent anything in the real world. In the same spirit, TPoX focuses on just one of various common XML use cases. Other XML benchmarks that focus on other use cases, such as XML content and full-text search, are also desirable but yet to be defined.
TPoX is entirely open-source (with major contributions from Intel and IBM). In TPoX 1.3 contributors from the University of Furtwangen in Germany have added initial support for Oracle Database and Microsoft SQL Server. In particular, they adjusted the TPoX queries to support Oracle Database and SQL Server syntax, and they have extended the TPoX workload driver so it connects to Oracle Database and Microsoft SQL Server. Anybody, including Oracle, is welcome to enhance, revise, or modify the TPoX benchmark as they deem appropriate for meaningful benchmarking.
The TPoX benchmark is a useful measuring stick for the many organizations who have transactional systems with small XML documents. I am amused that Oracle, on the one hand continually highlights the need for separately handling the diverse XML uses cases, and then on the other hand complains that TPoX handles only one use case and not a diverse range of use cases. Don’t they realize that they are contradicting themselves
Oracle also claims that TPoX attempts to follow the Transaction Processing Performance Council (TPC) approach, and that the TPC approach deviates from production system workloads. It is true that many people, including myself, consider some of the TPC benchmarks to have flaws. However, they still serve a purpose for people who are evaluating database options. Although the benchmarks are not a direct indication of a performance in an end user’s environment, they are still a useful tool for indicating relative performance.
I am not aware of any any alternative XML benchmarks proposed by Oracle. If Oracle has an XML benchmark that they believe is better, it would be great for everyone in the industry if they would bring it forward. Everyone would benefit from such a move, especially the people who are trying to evaluate their XML storage options. I am curious to know why Oracle hasn’t published any benchmark results for its XML capabilities, and instead focuses its efforts on debates that are difficult to resolve. IBM has published both TPoX benchmarks results and internal benchmark results. When are Oracle going to step up to the plate?
Oracle has long claimed that the fact that Oracle Database has multiple different ways to store XML data is an advantage. At last count, I think they have something like seven different options:
- Unstructured
- XML-Object-Relational, where you store repeating elements in CLOBs
- XML-Object-Relational, where you store repeating elements in VARRAY as LOBs
- XML-Object-Relational, where you store repeating elements in VARRAY as nested tables
- XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to BLOBs
- XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to nested tables
- XML-Binary
Their argument is that XML has diverse use cases and you need different storage methods to handle those diverse use cases. I don’t know about you, but I find this list to be a little bewildering. How do you decide among the options? And what happens if you change your mind and want to change storage method?
But back to my original question… Why don’t Oracle publish results for the TPoX benchmark? Perhaps it is because Oracle are still trying to figure out which of their seven storage options is best to use
Building an XML-Based Electronic Forms Solution
June 16, 2009
Many organizations are putting their forms “online”. If you are working on an electronic forms project, I’d like to let you know about a couple of useful resources that my colleague Bryan Patterson has been busy creating:
- A step-by-step tutorial that shows you how to create an electronic forms-based solution. You don’t need to purchase any software to get this demo working in your environment. It uses the trial version of Lotus® Forms to create and manage the online forms; it uses the no-charge version of DB2® Express-C to receive and store the XML data; it uses the no-charge version of IBM Data Studio Developer to create a simple Web service; and it uses the no-charge WebSphere Application Server Community Edition. To see the tutorial, go to Build an intelligent eForms solution based on DB2 pureXML, Lotus Forms, and Web services.
- A video that provides an overview for the above solution and walks through the step-by-step tutorial. To see the video, go to Create an electronic form solution with DB2 pureXML and Lotus Forms.
Make sure to check out the list of resources for both of these… they contain some useful links.
Flirting with Poken
June 9, 2009
It seems like every conference I go to this year has successively stronger ties to virtual networking. There are increasing levels of Twitter activity being shown on giant displays. It’s interesting to watch conference attendees routinely ignore giant displays, until they realize that one is showing real-time tweets about the conference, and then stopping in their tracks to take in the twitter stream.
The advent of blogs dedicated to individual sessions is also interesting. These blogs provide an unintimidating venue for people who are not comfortable asking questions in a large room. They also allow a conversation to continue after the conference. Although, the traffic to them is quite limited. I imagine that we are all struggling to keep up with the bandwidth demands of these new networking tools.
But my most interesting recent experience was with the Poken goven to all conference attendees at the recent IBM Information on Demand conference in Berlin. The Poken allows you to exchange a “digital handshake” with other conference attendees. By touching Pokens, you exchange contact details, including information about your accounts on popular social networking sites like LinkedIn, Twitter, and Facebook. You can also exchange details with a special Poken to get the presentation for the session you are attending.
The first thing I must say is that I was very happy to see a Poken help desk after registering. I did need a little help because initial attempts to exchange information were not succeeding. I was not giving my Poken enough time to exchange details.
The second thing I must say is that, after using the Poken for the entire conference, I still want to give out business cards. You see, after I get a business card, I find a few minutes to write a few notes on that business card. This way I have some additional context when I return from a conference with another stack of business cards. Thankfully I continued to do this at the conference because all I get with the poken is a sequential list of new Poken friends. I can see their profile, including a photo if they uploaded one. But it does not give me enough context for a follow up (unless they are especially memorable).
So, in my opinion, the Poken is a useful addition to the business card, but it will not replace it. You may argue that I could simply write some notes elsewhere and keep track of them. But there something very easy about writing a few key words on the back of a business card. For me the best solution would be to use my smartphone to easily “poke” people. Then, if I could add notes to newly acquired “business cards” on my smartphone, I could truely consider replacing the business card.
DB2 pureXML for Dummies—Get Your Copy!
June 3, 2009
If you want to learn more about native XML databases and DB2 pureXML, this eBook uses the fun and easy-to-understand “for Dummies” format to do just that. It introduces these topics, while guiding you to create your first native XML database. You don’t need to purchase any software to get started creating native XML databases—you simply use the freely available version of DB2. To get your copy, download DB2 pureXML for Dummies. Make sure to download it today as there are a limited number of free downloads available.
Matthias Nicola on XML in the Data Warehouse
May 28, 2009
Here is a short video showing Matthias Nicola speaking about XML in the data warehouse at the IDUG conference. He talks about the new features in DB2 that support native XML data in data warehouse environments. Apologies for the choppy nature of the video. It was taken by hand with my inexpensive pocket camcorder. You can click on the HQ button in the YouTube viewer to see the higher quality version.
Attendees of the upcoming Information on Demand conference in Berlin will get a Poken with their conference badge. Poken is a new tool that offers a smart way to network and share data.
When you insert the Poken into the USB port of a computer, you are connected to the Poken website where you fill in your personal data and create a profile, including links to your profiles on social networking sites like Linked-In, Facebook, Twitter, Myspace, and so on. This will then be stored on your Poken.
To network with someone at the conference, simply hold your Poken up to theirs and exchange IDs, creating an electronic handshake. The next time you connect your Poken to a computer and go to your profile page, you will see all the profiles of exchanged IDs together with their Social Networking sites. This enables you to easily stay in contact post-conference. No more dog-eared business cards surfacing weeks later!
Not only that, but you can also use your Poken to facilitate information download. For example, you can easily obtain session information and session presentations. This is my virgin Poken experience at a conference, so I am really curious to see how it works out…
Short Video from the IDUG North America Confernece
May 13, 2009
Greetings from the IDUG North America Conference in Denver, Colorado. IDUG is the International DB2 Users Group—an independent, not-for-profit organization for DB2 users by DB2 users. If you are a DB2 user, IDUG provide an invaluable resource. Here is a video showing a few short glimpses from Day 1 of the conference:
DB2 Compresses XML Data by 60% to 80%
May 6, 2009
This continues my series of posts about the new features for working with native XML data in the IBM DB2 database software.
Compression reduces the amount of storage space needed for data. Data storage costs money, so minimizing this cost is very important for many organizations. Especially when storage costs can be reduced by 60% to 80%. Storage-related costs include the actual storage devices themselves, the power consumed by those storage devices, and the time spent maintaining these devices.
Another benefit of data compression is that it often improves database performance. Because the data requires less disk space, you typically have reduced levels of disk I/O activity, which can improve database performance. Also, because more data is being cached, you may also enjoy improved buffer pool hit ratios. In many cases, the performance gain due to reduced I/O and better memory utilization outweighs the extra CPU cycles required to compress and decompress the data.
When storing XML data, DB2 typically places the XML data in a location called the XML Data Area (XDA). However, if the XML data is less than 32KB in size, it can be stored with the relational data (this is called inlining).
With DB2 9.5, you can compress XML data that is inlined, allowing you to reduce storage for XML data. For instance, the XML transactions in the TPoX benchmark are typically smaller than 32k, allowing them to be inlined and compressed. In the most recent TPoX benchmark, one terabyte of raw XML data is stored in 390 gigabytes of storage, giving a compression ratio of 61%.
DB2 9.7 extends compression to all XML data, regardless of whether it is in the XDA or inlined. In other words, DB2 9.7 can compress XML data, regardless of size. (The maximum size of an individual piece of XML data that can be stored in DB2 is 2 gigabytes.)
The degree to which XML data can be compressed depends on the nature of the XML data. IBM has tested the new data compression features with six different data sets. Three of these data sets were supplied by IBM clients, and represent real world client usage. The other three data sets represent XML data sets available in the public domain. The data sets include XML documents that range in size from 2KB to 100MB. The following diagram shows the storage savings that have been achieved (this diagram is from Cindy Saracco and Matthias Nicola’s article titled Enhance business insight and scalability of XML data with new DB2 V9.7 pureXML features).

As you can see, compressing XML data typically results in 60 to 80 percent disk space savings with DB2 9.7.
Finally, I’d also like to mention that if you compress XML data, you can also compresses any indexes for that XML data. Compressed indexes also reduce physical I/O and increase buffer pool hit ratios, which often leads to a net performance gain.
This continues my series of posts about the new features for working with native XML data in the IBM DB2 database software.
In my previous blog post, I discussed using hash partitioning and shared-nothing architectures to improve query response times when you have large amounts of XML data. Database partitioning essentially uses a divide-and-conquer approach by spreading the data storage and query processing workloads across multiple database nodes. Another approach to improving query response times is Multi-Dimensional Clustering (MDC). MDC is an existing DB2 and InfoSphere Warehouse feature. The latest release of these products allows you to use MDC with database tables that contain native XML data.
With MDC, the database physically organizes the data to improve performance for certain types of queries. You can organize the data according to the values in one or more database columns (or dimensions as they are called in MDC). However, MDC really shines in situations where there are queries against multiple database columns. You can use both partitioning and MDC on the same database.
When setting up MDC, you specify the database columns that are commonly referenced in queries. MDC then synchronizes the physical order of data in the database with the order of values for the chosen dimensions. It also maintains special indexes that are much smaller than regular record-based indexes (and therefore can be scanned faster). In other words, you determine the information that is most likely to be used in database queries, and MDC organizes the database table to get fast performance for queries against that information.
To illustrate, consider a situation where an application frequently accesses sales information from a large table by product, region, and date. To support queries against these dimensions, an administrator can use MDC to physically organize the database rows by the values of these product, region, and date dimensions. In other words, the database “clusters” the database rows according to the values in the product, region, and time columns. Because the rows are organized in “clusters” like this, it takes less time to answer database queries because the database needs to visit less database pages to find the relevant values.
In warehouse environments, you often need to load data from an external source system (which is called roll-in) and delete data that is no longer needed for analysis (which is called roll-out). Taking advantage of the unique data layout of MDC tables allows DB2 to optimize roll-in and roll-out operations.
In attempting to describe MDC in the simplest possible terms, I have presented only the highest-level details about this feature. If you want to learn exactly what happens at the database cell, block, and page level, please read the DB2 documentation for more details.
DB2 raised some eyebrows when it used Intel-based hardware to process more than 6700 XML-based transactions per second in the TPoX benchmark. Some people who—for performance reasons—had previously discounted XML as a data format, decided to take another look. It became apparent that, with recent advances in server hardware and database software, good performance is possible for XML-based transactional systems.
Before we knew it, these same people were asking for performance breakthroughs for analytical workloads. Now that they had information in XML format and they could work with that XML data in native format, they wanted to be able to analyze the information in the XML data. In other words, they wanted to be able to extract business intelligence directly from the native XML data. To help these organizations, IBM added a number of new features that improve performance for analytical workloads. In this blog post, I will talk about the addition of pureXML support to database partitioning (which has previously been called Database Partitioning Feature or DPF).
If you need to analyze large amounts of native XML data, IBM offers InfoSphere Warehouse (which is powered by DB2). When you have very large amounts of data in the warehouse, database partitioning helps you scale the data warehouse by spreading the data across multiple database partitions. When storing a row of data, database partitioning uses the value of a hash function to determine the partition in which to place the data. Each row of data is stored only in one partition.
When InfoSphere Warehouse receives a SQL or XQuery request, one of the partitions coordinates the query response. It instructs each partition to execute the query and it assembles the responses for the query result. Because each partition executes the query in parallel (in other words, at the same time), you get fast responses to queries against large amounts of XML data.
Database partitions can be on the same machine, or they can be on separate machines.
If you have multiple partitions on the same machine, you will typically have at least one processor core for each partition. For example, let’s say you have a single machine with two dual-core processors. That is, you have a total of four processor cores, then you may decide to have 4 partitions. Of course, you could configure more partitions than you have processor cores, but you wouldn’t enjoy an performance gains due to parallel processing.
When you have partitions on different machines, InfoSphere Warehouse uses a shared-nothing architecture to scale to hundreds or even thousands of machines. Let’s say you choose to have eight machines, where each machine has 4 processor cores. In such a case, you could choose to have 32 partitions (eight machines multiplied by four cores). With such a shared-nothing architacture, you can start your warehouse with moderate hardware, and add additional machines as your data volume grows.
You can use database partitioning to speed the loading, inserting, querying, updating, deleting, validating, and publishing of XML data. Database partitioning is ideal for situations where you have very large volumes of XML data, or for situations in which you have complex analytical queries that take a long time.
Here is some feedback from participants of the Early Access Program:
“By using XML to store the data and XQuery to query it, we have been able to run complex queries over massive amounts of data and provide researchers timely access to patient’s detailed medical records in a very manageable and understandable format. We are also in a better position to identify patients that meet the criteria for clinical studies and adjust their course of treatment at the appropriate time. We are now excited about the scalability advantages that the DPF features of DB2 will deliver. The parallelization of queries and inserts of the XML data will allow our researchers to gain access to the data they require even faster, which gives them an advantage in requesting grants and doing further research.”
—Tom Holdener, BJC HealthCare
“UCLA Medical Center has been leveraging DB2 pureXML to keep more unstructured patient records online and provide more comprehensive health care. The hospital is experiencing the compression benefit reducing the amount of storage space for patient’s medical records in XML by 50% today. In addition to the UCLA enterprise-wide medical record repository, there are three clinical applications currently under development and being re-architected to implement the DB2 9.7-scalable XML features at the core as a result of extensive collaboration with the IBM development team. The enhancements will allow ULCA to generate business intelligence using XML data to help meet patient care needs.”
—Charles M. Wang, UCLA Health System
Fast Performance for Native XML Data in the Warehouse
April 22, 2009
Today, IBM announced new releases of DB2 and InfoSphere Warehouse that support high-performance analytic workloads for native XML data. New features include:
– Support for native XML data with hash partitioning (database partitioning)
– Support for native XML data with range partitioning (table partitioning)
– Support for native XML data with multi-dimensional clustering
– Support for predicate pushdown with native XML data in database views
– Deep compression of native XML data
– And more
These features have been tested by more than 300 participants in the DB2 Early Access Program, which has been active since April of 2008. Here is a small selection of the feedback we have received from those participants:
“…XML in a data warehouse was a very important feature to implement, and it’s now a complete feature…”
Jean-Marc Blaise, Venedim
“…The parallelization of queries and inserts of the XML data will allow our researchers to gain access to the data they require even faster…”
Tom Holdener, BJC HealthCare
“…XML compression reduces disk space consumption by more than 60%, irrespective of the size of the XML documents…”
Phil Nelson, ScotDB
“…The enhancements will allow ULCA to generate business intelligence using XML data to help meet patient care needs…”
Charles Wang, UCLA Health System
“…The new XML scalability and analytic capabilities in DB2 are quite exciting…”
Neal Keene VP of Industry Solutions, Thunderhead
Over the course of the next week or so, I will talk in more detail about these new features.
The XML Rap Superstar
April 22, 2009
Rahsheen Porter and his very clever XML Rap won one of the main prizes in the “Video Contest” part of the XML Challenge. Enjoy…
IDUG announces winners of XML Challenge
April 22, 2009
The International DB2 Users Group (IDUG) are an independent, not-for-profit, user-run organization. If you work with DB2, you should certainly be a member of IDUG. Basic membership is FREE, and they offer some of the best resources for DB2 users.
I see from an article on XML Journal that IDUG has announced seven winners of its global XML Challenge. The challenge had more than 70,000 entries from North America, Europe, China, Southeast Asia and India. Congratulations to the winners:
- Renzo Nuccitelli from Sao Jose dos Campos, Brazil who captured first place in the “XML Programming” contest.
- Anil Mahadev from Karnataka, India who won the “Ported Application” contest.
- Yasuhiro Nonaka from Tokyo, Japan who won the “Query Challenge” contest.
- Zaidan Alaoui from Ontario, Canada and Vibha Pandya from Maharashtra, India who won the “Gadget Challenge” contest.
- Rahsheen Porter from Grayson, Geo. (USA) and Rendra Setiawan from Jakarta, Indonesia who won the “Video Mania” contest.
Teaser for Upcoming DB2 pureXML Features
April 20, 2009
If you want a teaser for upcoming DB2 pureXML features, you can listen to a short developerWorks podcast. In the podcast, Scott Laningham talks to Cindy Saracco about DB2 pureXML. The 10-minute talk covers XML, DB2 pureXML, industry standards, and more. It includes coverage about the DB2 pureXML features that are currently in development. You can hear the podcast at:
http://www.ibm.com/developerworks/blogs/page/scott?entry=cindy_saracco_on_purexml_and


Conor O'Mahony, Program Director for DB2 Product Marketing at IBM.

