Now it’s only one week until the International DB2 User Group (IDUG) holds their annual North American conference, this time in Anaheim, CA, on May 2 – 6. The conference agenda looks terrific, with more than 100 technical sessions, full-day seminars, panel discussions, vendor solution presentations, and of course lots of opportunity for networking.

The technical program also includes several sessions on XML:

Session G04: “Relational vs. pureXML: Battle of the Titans”
Konstantin Tadenev (UPS), Tuesday, May 3,  4:30pm

Session E08: “Real-World XML Application Development Patterns for DB2”
Fred Sobotka (FRS Consulting), Wednesday, May 4, 2:45pm

Session E14: “The XMLTABLE Function– A Swiss Army Knife for Application Developers!”
Matthias Nicola (IBM Silicon Valley Lab), Friday, May 6, 9:15am

I’m really looking forward these sessions.

Konstantin’s session should be particularly interesting, especially for those people who remain doubtful about XML performance. Konstantin and his colleagues have conducted a comparative performance study where they addressed a real-life business problem first with traditional relational database means and then with DB2 pureXML.  The results illustrate considerable throughput improvements and CPU time savings for the pureXML solution!

Fred’s session is certain to be very interesting as he will be sharing his rich experience from a variety of DB2 pureXML projects that he has been involved in. These include projects in e-commerce, government, and Fortune 500 companies where native XML storage, XML shredding, and construction of XML from relational data were required.

I hope to see you in Anaheim!

Advertisements

Since XML has become so pervasive in enterprise computing and service-oriented architectures, it is a given that XML capabilities are required in all parts of an IT environment. Closely related to the area of databases is ETL –  to extract, transform, and load information, e.g. to populate data warehouses or to integrate and connect disparate systems.

The good news is that version 8.5 of IBM InfoSphere DataStage has greatly enhanced XML capabilities. I first saw the new XML features in DataStage 8.5 in demo at the Information On Demand conference last October. And I was very impressed because the XML support goes far beyond the half-hearted XML handling that many tools offer.

For example, it’s easy to import and work with XML Schemas in DataStage 8.5. Many industry standard XML Schemas that are used in the financial sector, health care, insurance, government, retail, etc. are quite complex and consist of dozens or even hundreds of XSD files that comprise a single XML Schema. Examples include FpML, FIXML, HL7, IRS1120, OAGIS, and many others.

You might receive such a schema as a ZIP file that contains multiple folders with XSD files. DataStage 8.5 can simply read the entire ZIP file, which saves you the tedious job of importing all the XSD files separately or dealing with their import and include relationships.

Once the XML Schema is imported, DataStage understands the structure of your XML document and allows you define the transformations that you need.

The XML transformation capabilities certainly include some of the intuitive things. For example, you can:

  • compose new XML documents from relational tables or other sources
  • shred XML documents to a set of relational row sets (or tables)
  • extract selected pieces from each XML document and leave other parts of the XML unparsed and “as-is”
  • extend your XML processing by applying XSLT stylesheets to the incoming XML data

And there is also a powerful set of transformation steps that allow you to implement any custom XML transformation that you may need.

A big bonus is the ability to process even very large XML files very efficiently. When batches of XML documents are moved between systems, they are sometimes concatenated into a single large XML document that can be GBs in size. Such mega-documents typically contain many independent business objects and often need to be split by the consumer. DataStage 8.5 handles documents of 20GB or larger very efficiently and can even parse and process a single large document with multiple threads in parallel! This is very cool and a big win for performance.

There is a nice 2-part article on developerWorks that describes these XML capabilities in more detail:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1103datastages/index.html

http://www.ibm.com/developerworks/data/library/techarticle/dm-1103datastages2/index.html

And the documentation of the “XML stage” starts here:

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.ds.stages.xml.usage.doc/topics/introduction_to_xml_stage.html

This is the continuation of my previous post, to see how we can write a query that achieves the following:

  • if the value of <a1> equals the value of <c1>, then return <a2>
  • similarly, if <b1> equals <c2>, then return <b2>

And here is the sample data that we want to use. Note the twist that the first document contains two <a> elements!

drop table t1;
create table T1(doc XML);

insert into t1 values(
'<root docNo="1">
   <a>
     <a1>1</a1>
     <a2>1111</a2>
   </a>
   <a>
     <a1>90901</a1>
     <a2>90902</a2>
   </a>
   <b>
     <b1>1111</b1>
     <b2>2222</b2>
  </b>
  <c>
     <c1>1</c1>
     <c2>0</c2>
  </c>
</root>');

insert into t1 values(
'<root docNo="2">
   <a>
     <a1>1</a1>
     <a2>5555</a2>
   </a>
   <b>
     <b1>2</b1>
     <b2>6666</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>2</c2>
   </c>
</root>');

To return <a2> if the value of <a1> equals the value of <c1>, and to return <b2> if <b1> equals <c2>, you could follow the exact same query pattern as in my previous post and write the following XQuery:

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i[ a/a1 = c/c1 ]/a/a2}
         {$i[ b/b1 = c/c2 ]/b/b2}
       </result>;

<result docNo="1"><a2>1111</a2><a2>90902</a2></result>
<result docNo="2"><a2>5555</a2><b2>6666</b2></result>

 2 record(s) selected.

However, we probably don’t want <a2>90902</a2> in the first result record, because its corresponding <a1> sibling does not match <c1>!

The problem is that the XPath expression $i[ a/a1 = c/c1 ]/a/a2  returns all occurrences of /a/a2 in the document if only one of  the a/a1 equals c/c1. We need to write an XPath expression that recognizes the fact that the elements <a1> and <a2> are pairs under a common <a> element. What we want is the following: if <a1> equals <c1>, then return the <a2> element that has the same parent as the <a1> that equals <c1>.

The following query does the trick. Note that the predicate in square brackets is now applied to $i/a instead of $i. In other words, we shift the predicate one level down in order to address the <a1> and <a2> under a common <a> element.

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i/a[a1 = ../c/c1]/a2}
         {$i/b[b1 = ../c/c2]/b2}
       </result>;

<result docNo="1"><a2>1111</a2></result>
<result docNo="2"><a2>5555</a2><b2>6666</b2></result>

 2 record(s) selected.

Or the same in the SQL/XML:

SELECT X.*
FROM t1, XMLTABLE('$DOC/root'
 COLUMNS
    docno      INTEGER PATH '@docNo',
    a1equalsc1 INTEGER PATH 'a[a1=../c/c1]/a2',
    b1equalsc2 INTEGER PATH 'b[b1=../c/c2]/b2' ) as X;


DOCNO       A1EQUALSC1  B1EQUALSC2
----------- ----------- -----------
          1        1111           -
          2        5555        6666

 2 record(s) selected.

It’s all about climbing up and down the branches of the XML document tree to get the correct twigs that we want. In fact, drawing the document as a tree on a piece of paper can be a great help!

 

 

 

A few weeks ago somebody asked me how to select information from one branch of an XML document based on search conditions in another branch of the same document. Writing such a query in XPath (or XQuery or SQL/XML) is actually not very complicated. It’s a nice example of a query that is very hiearchical and “navigational” in nature, and yet very easy to express in XPath.

Let’s look at a simple example, using the three XML documents and a simple table, as shown below. We want to query each of these documents as follows:

  • if the value of <a1> equals the value of <c1>, then return <b1>
  • similarly, if <a2> equals <c2>, then return <b2>
  • these two conditions are not mutually exclusive, i.e. both may be true in a given document
  • in any case, include the @docNo in the result
drop table t1;
create table T1(doc XML);

insert into t1 values(
'<root docNo="1">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>1111</b1>
     <b2>2222</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>0</c2>
   </c>
</root>');

insert into t1 values(
'<root docNo="2">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>3333</b1>
     <b2>4444</b2>
   </b>
   <c>
     <c1>0</c1>
     <c2>2</c2>
   </c>
</root>');

insert into t1 values(
'<root docNo="3">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>5555</b1>
     <b2>6666</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>2</c2>
   </c>
</root>');

To query these document as outlined abobve, you could write an XQuery or a SQL/XML query, depending on your preference. The core pieces of XPath required are the same in either case.

Here is one possible solution in XQuery and its result:

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i[ a/a1 = c/c1 ]/b/b1}
         {$i[ a/a2 = c/c2 ]/b/b2}
       </result>;

<result docNo="1"><b1>1111</b1></result>
<result docNo="2"><b2>4444</b2></result>
<result docNo="3"><b1>5555</b1><b2>6666</b2></result>

 3 record(s) selected.

In this XQuery, the “for” clause iterates over the root elements of the three document. For each document it constructs a <result> element, whose child nodes are defined by three expressions. In each of these expressions, note that $i is bound to the <root> element of the current document.

  • The first expression (line 4), simply includes the @docNo attribute from the <root> element.
  • The second expression (line 5), applies the predicate [a/a1=c/c1] to $i, such that the element $i/b/b1 is returned only if a1=c1 in this document
  • The third expression (line 6), applies the predicate [a/a2=c/c2] to $i, so that the element $i/b/b2 is returned only if a2=c2.

The same XPath expression can be plugged into a SQL/XML statement. For example, you can use the XMLTABLE function to return the result in list form:

SELECT X.*
FROM t1, XMLTABLE('$DOC/root'
 COLUMNS
    docno      INTEGER PATH './@docNo',
    a1equalsc1 INTEGER PATH '.[a/a1=c/c1]/b/b1',
    a2equalsc2 INTEGER PATH '.[a/a2=c/c2]/b/b2' ) as X;

DOCNO       A1EQUALSC1  A2EQUALSC2
----------- ----------- -----------
          1        1111           -
          2           -        4444
          3        5555        6666

 3 record(s) selected.

And of course you can modify these queries if you want the results to be returned in a different way.

Now there is an interesting variation to this exercise, which is slightly more tricky then it looks at first sight. Let’s say we change the requirements for our query to the following:

  • if the value of <a1> equals the value of <c1>, then return <a2>
  • similarly, if <b1> equals <c2>, then return <b2>

How would you change the queries above? And what happens if there can be multiple <a> elements and multiple <b> elements in each document? I’ll post the answer in a few of days…