The beauty of a “hybrid” database system is that it allows you to manage both XML and relational data at the same time, even side-by-side in the same table if you want. You can choose which data to store in XML and which in relational format. Best practices for determining the “right” mix of XML and relational in a database schema is an interesting topic, but I’ll reserve that for a separate blog post in the near future.

Today, let’s look at how you can insert information from relational columns into an XML document. Let’s start with a simple table and a couple of rows. Each row holds an XML document in an XML column as well as a document_id and the date when the document was created:

CREATE TABLE documents(document_id INTEGER, created DATE, document XML);

INSERT INTO documents
VALUES(1, '2011-05-20',
'<mydoc>
   <header>
     <title>How to skin a cat</title>
     <author id="17">John Doe</author>
   </header>
   <body>
     <info>There are many ways...</info>
   </body>
 </mydoc>');

INSERT INTO documents
VALUES(2, '2011-05-21',
'<mydoc>
   <header>
     <title>Introduction to XML</title>
     <author id="23">Peter Pan</author>
     <author id="49">Susie Swing</author>
   </header>
   <body>
     <info>What is XML?</info>
   </body>
 </mydoc>');

Now let’s assume we want to retrieve these documents and include the “created” date from the relational column as an XML element <created> within the <header> element. This is not hard. You can simply include an XQuery update expression in your query that retrieves the documents:

SELECT XMLQUERY('copy $new := $doc
                 modify do insert <created>{$crdate}</created>
                           into $new/mydoc/header
                 return $new'
                 PASSING documents.document as "doc",
                         documents.created  as "crdate")
FROM documents;

The XMLQUERY function contains the XQuery update expression (in single quotes) and the PASSING clause that passes column information to the XQuery expression. Here we are passing the XML columns “document” and the DATE column “created”, so that they can be referenced as XQuery variables $doc and $crdate.

DB2 for Linux, UNIX, and Windows also allows an abbreviated syntax, i.e. you can omit the PASSING clause and refeence the column names directly (in upper case), as in the following equiavalent query:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert <created>{$CREATED}</created>
                           into $new/mydoc/header
                 return $new' )
FROM documents;

The XQuery update expression assigns the XML from the XML column ($DOCUMENT) to the variable $new and then it modifies and returns $new. The “modify” clause inserts the new XML element <created> into the /mydoc/header element, using the value from the relational DATE column that is referenced here as $CREATED.

As a result, both of the queries above return the following output, with the desired <created> elements:

<mydoc>
  <header>
    <title>How to skin a cat</title>
    <author id="17">John Doe</author>
    <created>2011-05-20</created>
  </header>
  <body>
    <info>There are many ways...</info>
  </body>
</mydoc>

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <created>2011-05-21</created>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

It’s that simple!

Note that these queries do not change the documents that are stored in the “documents” column. They only modify the documents on the fly while reading them out. If you want to permanently change the documents, you would use the exact same XQuery update expression in an SQL UPDATE statement rather than a SELECT statement, like this:

UPDATE documents
SET document =  XMLQUERY('copy $new := $DOCUMENT
                          modify do insert <created>{$CREATED}</created>
                                    into $new/mydoc/header
                          return $new' );

But, let’s stick with queries for now and extend the XQuery update expression to also merge the relational document_id into the XML. Specifically, assume we need to include the document_id as an attribute “docid” in the “mydoc” element. We can simply add another “insert” operation in the modify clause, which is easy because the modify clause can have a comma-separated list of operations, including insert, replace, and delete operations.

To add an attribute you have to specify “insert attribute” followed by the desired attribute name (docid) and the attribute value in curly brackets. The curly brackets can either contain a literal value in double quotes or -as in our case here- a reference, such as our reference to the relational DOCUMENT_ID column:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify (do insert <created>{$CREATED}</created>
                            into $new/mydoc/header,
                         do insert attribute docid {$DOCUMENT_ID}
                            into $new/mydoc )
                 return $new' )
FROM documents;

And the result is as desired:

<mydoc docid="1">
  <header>
    <title>How to skin a cat</title>
    <author id="17">John Doe</author>
    <created>2011-05-20</created>
  </header>
  <body>
    <info>There are many ways...</info>
  </body>
</mydoc>

<mydoc docid="2">
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <created>2011-05-21</created>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

Going one step further, you might wonder how to insert data from relational columns in *other* tables into these XML documents. Well, you can do a join so that the desired values from the other table line up in the same row as your XML document, and then apply the same techniques as above.

It get’s slightly more interesting if you want add information from *multiple* rows into a single document, based on a one-to-many relationship. I can show you how to do this, but I’m about to board a flight to go  vacation, so you’ll find a couple of examples in my next blog post when I’m back, or many more examples in the DB2 pureXML Cookbook!

Intel has recently launched their new “Westmere-EX” CPUs that are now part of the Intel Xeon E7 processor family. For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark. So it comes as no big surprise that Intel has now also released TPoX benchmark results with their latest 10-core Xeon E7-4870 processor.

This benchmark uses TPoX Version 2.0 with 1TB of raw XML data, running on DB2 9.7 under SUSE Linux 10. DB2 compression reduces the 1TB of raw data to 537GB, which includes all required XML indexes.

The system under test is an Intel Xeon E7-4870 processor server with 4 CPUs, 40 cores, 80 threads, and a clock speed of 2.4GHz. Further details on the CPUs can be found here. The storage configuration consists of 54 Intel X-25E Solid State Drives (SSDs), 48 SSDs for the database tables and indexes plus 6 SSDs for the DB2 transaction log.

To showcase the maximum possible TPoX throughput on this hardware, Intel ran the TPoX workload driver with 440 concurrent connections (“users”) all of which issue a random stream of read and write transactions without think time. The result is a whopping 17,757 TTPS (TPoX Transactions per Second)!

This result constitutes 29% higher performance than on the previous generation CPU, the 8-core Nehalem-EX, which achieved 13,745 TTPS. Details on these TPoX benchmark results are available on the TPoX results page where you find a history of all TPoX results as well as examples of other TPoX usage.

The great thing about these benchmarks is that they continuously push hardware, operating system, and database to the edge, relentlessly exposing any inefficiencies that may exist and giving us the chance to continuously improve the technology. This latest TPoX benchmark is also one of the first TPoX tests that runs entirely on solid state drives!

Now it’s only one week until the International DB2 User Group (IDUG) holds their annual North American conference, this time in Anaheim, CA, on May 2 – 6. The conference agenda looks terrific, with more than 100 technical sessions, full-day seminars, panel discussions, vendor solution presentations, and of course lots of opportunity for networking.

The technical program also includes several sessions on XML:

Session G04: “Relational vs. pureXML: Battle of the Titans”
Konstantin Tadenev (UPS), Tuesday, May 3,  4:30pm

Session E08: “Real-World XML Application Development Patterns for DB2″
Fred Sobotka (FRS Consulting), Wednesday, May 4, 2:45pm

Session E14: “The XMLTABLE Function– A Swiss Army Knife for Application Developers!”
Matthias Nicola (IBM Silicon Valley Lab), Friday, May 6, 9:15am

I’m really looking forward these sessions.

Konstantin’s session should be particularly interesting, especially for those people who remain doubtful about XML performance. Konstantin and his colleagues have conducted a comparative performance study where they addressed a real-life business problem first with traditional relational database means and then with DB2 pureXML.  The results illustrate considerable throughput improvements and CPU time savings for the pureXML solution!

Fred’s session is certain to be very interesting as he will be sharing his rich experience from a variety of DB2 pureXML projects that he has been involved in. These include projects in e-commerce, government, and Fortune 500 companies where native XML storage, XML shredding, and construction of XML from relational data were required.

I hope to see you in Anaheim!

Since XML has become so pervasive in enterprise computing and service-oriented architectures, it is a given that XML capabilities are required in all parts of an IT environment. Closely related to the area of databases is ETL -  to extract, transform, and load information, e.g. to populate data warehouses or to integrate and connect disparate systems.

The good news is that version 8.5 of IBM InfoSphere DataStage has greatly enhanced XML capabilities. I first saw the new XML features in DataStage 8.5 in demo at the Information On Demand conference last October. And I was very impressed because the XML support goes far beyond the half-hearted XML handling that many tools offer.

For example, it’s easy to import and work with XML Schemas in DataStage 8.5. Many industry standard XML Schemas that are used in the financial sector, health care, insurance, government, retail, etc. are quite complex and consist of dozens or even hundreds of XSD files that comprise a single XML Schema. Examples include FpML, FIXML, HL7, IRS1120, OAGIS, and many others.

You might receive such a schema as a ZIP file that contains multiple folders with XSD files. DataStage 8.5 can simply read the entire ZIP file, which saves you the tedious job of importing all the XSD files separately or dealing with their import and include relationships.

Once the XML Schema is imported, DataStage understands the structure of your XML document and allows you define the transformations that you need.

The XML transformation capabilities certainly include some of the intuitive things. For example, you can:

  • compose new XML documents from relational tables or other sources
  • shred XML documents to a set of relational row sets (or tables)
  • extract selected pieces from each XML document and leave other parts of the XML unparsed and “as-is”
  • extend your XML processing by applying XSLT stylesheets to the incoming XML data

And there is also a powerful set of transformation steps that allow you to implement any custom XML transformation that you may need.

A big bonus is the ability to process even very large XML files very efficiently. When batches of XML documents are moved between systems, they are sometimes concatenated into a single large XML document that can be GBs in size. Such mega-documents typically contain many independent business objects and often need to be split by the consumer. DataStage 8.5 handles documents of 20GB or larger very efficiently and can even parse and process a single large document with multiple threads in parallel! This is very cool and a big win for performance.

There is a nice 2-part article on developerWorks that describes these XML capabilities in more detail:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1103datastages/index.html

http://www.ibm.com/developerworks/data/library/techarticle/dm-1103datastages2/index.html

And the documentation of the “XML stage” starts here:

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.ds.stages.xml.usage.doc/topics/introduction_to_xml_stage.html

This is the continuation of my previous post, to see how we can write a query that achieves the following:

  • if the value of <a1> equals the value of <c1>, then return <a2>
  • similarly, if <b1> equals <c2>, then return <b2>

And here is the sample data that we want to use. Note the twist that the first document contains two <a> elements!

drop table t1;
create table T1(doc XML);

insert into t1 values(
'<root docNo="1">
   <a>
     <a1>1</a1>
     <a2>1111</a2>
   </a>
   <a>
     <a1>90901</a1>
     <a2>90902</a2>
   </a>
   <b>
     <b1>1111</b1>
     <b2>2222</b2>
  </b>
  <c>
     <c1>1</c1>
     <c2>0</c2>
  </c>
</root>');

insert into t1 values(
'<root docNo="2">
   <a>
     <a1>1</a1>
     <a2>5555</a2>
   </a>
   <b>
     <b1>2</b1>
     <b2>6666</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>2</c2>
   </c>
</root>');

To return <a2> if the value of <a1> equals the value of <c1>, and to return <b2> if <b1> equals <c2>, you could follow the exact same query pattern as in my previous post and write the following XQuery:

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i[ a/a1 = c/c1 ]/a/a2}
         {$i[ b/b1 = c/c2 ]/b/b2}
       </result>;

<result docNo="1"><a2>1111</a2><a2>90902</a2></result>
<result docNo="2"><a2>5555</a2><b2>6666</b2></result>

 2 record(s) selected.

However, we probably don’t want <a2>90902</a2> in the first result record, because its corresponding <a1> sibling does not match <c1>!

The problem is that the XPath expression $i[ a/a1 = c/c1 ]/a/a2  returns all occurrences of /a/a2 in the document if only one of  the a/a1 equals c/c1. We need to write an XPath expression that recognizes the fact that the elements <a1> and <a2> are pairs under a common <a> element. What we want is the following: if <a1> equals <c1>, then return the <a2> element that has the same parent as the <a1> that equals <c1>.

The following query does the trick. Note that the predicate in square brackets is now applied to $i/a instead of $i. In other words, we shift the predicate one level down in order to address the <a1> and <a2> under a common <a> element.

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i/a[a1 = ../c/c1]/a2}
         {$i/b[b1 = ../c/c2]/b2}
       </result>;

<result docNo="1"><a2>1111</a2></result>
<result docNo="2"><a2>5555</a2><b2>6666</b2></result>

 2 record(s) selected.

Or the same in the SQL/XML:

SELECT X.*
FROM t1, XMLTABLE('$DOC/root'
 COLUMNS
    docno      INTEGER PATH '@docNo',
    a1equalsc1 INTEGER PATH 'a[a1=../c/c1]/a2',
    b1equalsc2 INTEGER PATH 'b[b1=../c/c2]/b2' ) as X;


DOCNO       A1EQUALSC1  B1EQUALSC2
----------- ----------- -----------
          1        1111           -
          2        5555        6666

 2 record(s) selected.

It’s all about climbing up and down the branches of the XML document tree to get the correct twigs that we want. In fact, drawing the document as a tree on a piece of paper can be a great help!

 

 

 

A few weeks ago somebody asked me how to select information from one branch of an XML document based on search conditions in another branch of the same document. Writing such a query in XPath (or XQuery or SQL/XML) is actually not very complicated. It’s a nice example of a query that is very hiearchical and “navigational” in nature, and yet very easy to express in XPath.

Let’s look at a simple example, using the three XML documents and a simple table, as shown below. We want to query each of these documents as follows:

  • if the value of <a1> equals the value of <c1>, then return <b1>
  • similarly, if <a2> equals <c2>, then return <b2>
  • these two conditions are not mutually exclusive, i.e. both may be true in a given document
  • in any case, include the @docNo in the result

drop table t1;
create table T1(doc XML);

insert into t1 values(
'<root docNo="1">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>1111</b1>
     <b2>2222</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>0</c2>
   </c>
</root>');

insert into t1 values(
'<root docNo="2">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>3333</b1>
     <b2>4444</b2>
   </b>
   <c>
     <c1>0</c1>
     <c2>2</c2>
   </c>
</root>');

insert into t1 values(
'<root docNo="3">
   <a>
     <a1>1</a1>
     <a2>2</a2>
   </a>
   <b>
     <b1>5555</b1>
     <b2>6666</b2>
   </b>
   <c>
     <c1>1</c1>
     <c2>2</c2>
   </c>
</root>');

To query these document as outlined abobve, you could write an XQuery or a SQL/XML query, depending on your preference. The core pieces of XPath required are the same in either case.

Here is one possible solution in XQuery and its result:

xquery
for $i in db2-fn:xmlcolumn('T1.DOC')/root
return <result>
         {$i/@docNo}
         {$i[ a/a1 = c/c1 ]/b/b1}
         {$i[ a/a2 = c/c2 ]/b/b2}
       </result>;

<result docNo="1"><b1>1111</b1></result>
<result docNo="2"><b2>4444</b2></result>
<result docNo="3"><b1>5555</b1><b2>6666</b2></result>

 3 record(s) selected.

In this XQuery, the “for” clause iterates over the root elements of the three document. For each document it constructs a <result> element, whose child nodes are defined by three expressions. In each of these expressions, note that $i is bound to the <root> element of the current document.

  • The first expression (line 4), simply includes the @docNo attribute from the <root> element.
  • The second expression (line 5), applies the predicate [a/a1=c/c1] to $i, such that the element $i/b/b1 is returned only if a1=c1 in this document
  • The third expression (line 6), applies the predicate [a/a2=c/c2] to $i, so that the element $i/b/b2 is returned only if a2=c2.

The same XPath expression can be plugged into a SQL/XML statement. For example, you can use the XMLTABLE function to return the result in list form:

SELECT X.*
FROM t1, XMLTABLE('$DOC/root'
 COLUMNS
    docno      INTEGER PATH './@docNo',
    a1equalsc1 INTEGER PATH '.[a/a1=c/c1]/b/b1',
    a2equalsc2 INTEGER PATH '.[a/a2=c/c2]/b/b2' ) as X;

DOCNO       A1EQUALSC1  A2EQUALSC2
----------- ----------- -----------
          1        1111           -
          2           -        4444
          3        5555        6666

 3 record(s) selected.

And of course you can modify these queries if you want the results to be returned in a different way.

Now there is an interesting variation to this exercise, which is slightly more tricky then it looks at first sight. Let’s say we change the requirements for our query to the following:

  • if the value of <a1> equals the value of <c1>, then return <a2>
  • similarly, if <b1> equals <c2>, then return <b2>

How would you change the queries above? And what happens if there can be multiple <a> elements and multiple <b> elements in each document? I’ll post the answer in a few of days…

 

 

 

Recently I received an interesting question about a date that is not a really date. Take the following XML element as an exanple:

<StartTime>Monday Mar 23 2009 10:20:24.19</StartTime>

This element contains a value that describes a date and a time. The problem is that the string “Monday Mar 23 2009 10:20:24.19” is really only a string and not a valid representation of a date or a timestamp in the world of XML.

The XML Schema specification defines a specific data type for timestamps: xs:dateTime. According to that spec, literal values must have a specific format if they want to be of type xs:dateTime, or better: if you want to process them as xs:dateTime with a language such as XPath or XQuery. For example, the value 2002-10-10T12:00:00 is a valid xs:dateTime value.

Unless you deal with negative dates, xs:dateTime values must be of the form yyyy-mm-dd’T'hh.mm.ss, plus optional fractional seconds and an optional timezone which I am omitting here for simplicity. The character ‘T’ is required as a delimiter between the date and the time portion of the literal.

So, how can you convert the original string value into a valid xs:dateTime value? Also, how can can you convert that string into a proper SQL timestamp?

Let’s start with an XQuery solution to convert the string above to xs:dateTime, and then we look at a SQL/XML solution to produce a SQL timestamp. Both approaches assume, for simplicity, that the input data is a small document in an XML column:

create table mytest(doc XML);
insert into mytest values ('<root><StartTime>Monday Mar 23 2009 10:20:24.19</StartTime></root>');

XQuery

The main trick is to break the string into its pieces for year, months, day, time, and so on. This can be done with the fn:tokenize function that I wrote about earlier. Go back to my post about fn:tokenize if you need to refresh your memory on that function.

Here is a first XQuery that is not quite perfect yet, but a good start:

xquery 
for $t in db2-fn:xmlcolumn('MYTEST.DOC')/root/StartTime
let $pieces := fn:tokenize($t, " ")
let $timestamp := fn:concat($pieces[4],"-",$pieces[2],"-",$pieces[3],"T",$pieces[5])
return $timestamp;
----------------------------------
2009-Mar-23T10:20:24.19
 1 record(s) selected.

In the “for” clause,  the StartTime element gets assigned to the variable $t. In the first “let” clause, the function fn:tokenize splits the string value of StartTime along the spaces in the string. The second parameter of the fn:tokenize function specifies that the blank character is a delimiter in the string, and the string should be tokenized along that delimiter. The variable $piece now contains a sequence of tokens.

In the second “let” clause we use the function fn:concat to concatenate the tokens in a different order and with different delimiters. We start with the 4th token from the input string (year), append a dash, then append the 2nd token (month), then another dash, then the 3rd token (day), and so on.

The result is almost what we want. The remaining task is to convert the month “Mar” into a “03″. And of course we need to be able to perform this conversion for any of the 12 months that can possibly occur. For example, you could think of a 12-way if-then-else expression, but that would be rather ugly.

A more elegant approach is to build simple “month map” from which we pick the right month. Look at the two additional “let” clauses in bold below. The “let $map” defines a constant piece of XML that maps the three-letter month names to the numbers 01, 02, 03, and so on. Then the “let $month” performs a lookup in that map to pick the element whose name matches $piece[2], which is “Mar” in our example, and the /text() value of the element <Mar> is “03″. This “03″ is then used in the concat of the final result:

xquery 
for $t in db2-fn:xmlcolumn('MYTEST.DOC')/root/StartTime
let $pieces := fn:tokenize($t, " ")
let $map := <map><Jan>01</Jan><Feb>02</Feb><Mar>03</Mar><Apr>04</Apr>. . . </map>
let $month := $map/*[name()=$pieces[2]]/text()
let $timestamp := fn:concat($pieces[4],"-", $month,"-",$pieces[3],"T",$pieces[5])
return xs:dateTime($timestamp);
----------------------------------
2009-03-23T10:20:24.19 
 1 record(s) selected.

Also n0te that the cast to xs:dateTime in the “return” clause proves that we have produced a valid xs:dateTime value.

SQL/XML

Now that we have the XQuery solution above, it is trival to plug this into an SQL/XML query to produce a SQL timestamp:

SELECT XMLCAST ( XMLQUERY('for $t in $DOC/root/StartTime
   let $pieces := fn:tokenize($t, " ")
   let $map := <map><Jan>01</Jan><Feb>02</Feb><Mar>03</Mar><Apr>04</Apr> </map>
   let $month := $map/*[name()=$pieces[2]]/text()
   let $timestamp := fn:concat($pieces[4],"-", $month,"-",$pieces[3],"T",$pieces[5])
   return $timestamp')  AS TIMESTAMP)
FROM mytest;
--------------------------
2009-03-23-10.20.24.190000

 1 record(s) selected.

The XMLQUERY function contains the same XQuery expression as before to produce an xs:dateTime value, which the XMLCAST function can than cast to the SQL type TIMESTAMP. Simple as that.

As always, there are multiple solutions to every problem, so here is another one. This query use fn:tokenize to split the string into its five pieces and each piece is returned by the XMLTABLE function as a separate relational column. The expression in the SELECT clause concatenates the pieces in the right order, and uses a good old SQL CASE expression to map the three-latter month name to the equivalent two-digit number:

SELECT year || '-' || 
   (CASE
      WHEN month = 'Jan' THEN '01'
      WHEN month = 'Feb' THEN '02'
      WHEN month = 'Mar' THEN '03'
      WHEN month = 'Apr' THEN '04'
      WHEN month = 'May' THEN '05'
      WHEN month = 'Jun' THEN '06'
      WHEN month = 'Jul' THEN '07'
      WHEN month = 'Aug' THEN '08'
      WHEN month = 'Sep' THEN '09'
      WHEN month = 'Oct' THEN '10'
      WHEN month = 'Nov' THEN '11'
      WHEN month = 'Dec' THEN '12'        
      ELSE 'unknown'
    END)      
  || '-' || day  || '-' || time
FROM mytest, XMLTABLE('$DOC/root' 
 COLUMNS
   dayOfweek  VARCHAR(20)  PATH 'fn:tokenize(StartTime, " ")[1]',
   month      CHAR(3)      PATH 'fn:tokenize(StartTime, " ")[2]',
   day        INTEGER      PATH 'fn:tokenize(StartTime, " ")[3]',
   year       INTEGER      PATH 'fn:tokenize(StartTime, " ")[4]',
   time       VARCHAR(32)  PATH 'fn:tokenize(StartTime, " ")[5]'          
 ) T;

----------------------------------------------------------------
2009-03-23-10:20:24.19

 1 record(s) selected.

Which of these solutions do you like best?

TPoX stands for “Transaction Processing over XML” and is an XML database benchmark that Intel and IBM have developed several years ago and then released as open source. TPoX is an application-level benchmark, which means that it simulates an actual application that performs queries, inserts, updates and deletes in a concurrent multi-user workload.

Of course, no single benchmark can represent the characteristics of all types of database applications. In fact, a benchmark should not even try to do that because it would end up being an unrealistic workload. Take the traditional relational database benchmarks as an example: the TPC-C benchmark is an OLTP workload while the TPC-H benchmark mimicks a decision support scenario, two very different usage patterns for relational databases, hence two different benchmarks.

TPoX is designed to be an XML transaction processing (OLTP) benchmark with data-oriented XML structures, very large numbers of relatively small XML documents (1kb to 20kb), short read/write transactions, and a high degree of concurrency.

Many characteristics of real-world financial XML applications have gone into the design of TPoX. For example, the “order” messages are defined by the FIXML standard that is used in the financial industry. The actual FIXML schema is used for document validation in TPoX. To give you an idea of its complexity: FIXML Version 4.4 consists of 41 schema documents (XSD files) and defines more than 3600 elements and attributes that can appear in FIXML messages.

Obviously, TPoX does not intend to be a content-oriented XML benchmark where full-text search and other content-centric operations would be exercised. Another XML benchmark is needed to cover that space.

Since 2006 both Intel and IBM have published TPoX results for various different scale factors and hardware platforms. In 2009 and 2010 Intel has continued to run the TPoX benchmark internally to evaluate their ever evolving fleet of Xeon processors. Four benchmark results from that period have now been published on the TPoX web site, showcasing the performance of the Intel Xeon processors X7560, X5680, X5570, and X7460.

These benchmark results from Intel are vary valuable in several ways:

  • The results prove XML processing performance on different hardware choices, ranging from 2 x 4 cores all the way to 4 x 8 cores in a single server.
  • They quantify the performance gain that each new generation of Xeon processors provides over the previous.
  • These TPoX benchmarks continuously push hardware, operating system, and database server to higher levels of performance. This helps to identify and exploit opportunities for performance enhancements which ultimately drives innovation, advances technology, and improves products quality.
  • The range of TPoX results serves as yard stick for companies that need to choose hardware and software for their XML database needs.

Other software vendors (including Oracle) as well as Universities have also been using TPoX for a variety of purposes. Although some database vendors other than IBM have been using TPoX, to the best of my knowledge they haven’t disclosed the details of their testing, such as the exact results together with detailed information about how the hardware and database were configured.

Ok, the readers who are not from the mainframe world may wonder what “CICS” means. According to IBM’s official web site, CICS is defined as follows: “CICS (Customer Information Control System) is a family of application servers and connectors that provides industrial-strength, online transaction management and connectivity for mission-critical applications.” There you have it.

So, CICS is a transaction server (plus tools and extensions) that runs on IBM mainframes on the z/OS and z/VSE operating systems. It is commonly used for high-volume online transaction processing applications. CICS applications are typically written in COBOL, PL/1, or C.

CICS Version 3.1 has introduced support for web services and SOAP, which enables CICS applications to be web service providers or web service consumers (requester). The transport mechanism used is typically HTTP or MQ message queues, and the web service messages are (almost always) in XML format.

Applications that use web services, e.g. in service-oriented architectures, often have the need to store incoming and/or outgoing messages in a database. Sometimes this is needed for compliance reasons or simply to have a “log” of the information that is being exchanged. This log can be indexed and queried efficiently.

Since the web service message are in XML and their content and structure may vary, storing them in their original XML format is a natural choice and a very common use case for the XML databases, including DB2 pureXML.

With the pureXML support in DB2 9 for z/OS it becomes straight forward to combine CICS web services with XML management in DB2. For example, XML messages of varying format can be stored in the same XML column, exploiting the support for schema variability in DB2. Here are some examples of what you can do:

Receive and store web service messages in a CICS application (in-bound):

  • Store incoming SOAP messages as-is in an XML column in a DB2 table
  • Extract the payload from incoming SOAP messages and insert this payload into an XML column
  • Shred the payload of an incoming SOAP message to a set of relational tables, using SQL/XML

Produce and sent web service messages in a CICS application (out-bound):

  • Retrieve an XML document from an XML column in a DB2 table and send it out
  • Extract pieces from one or multiple XML documents in an XML column and combine them to an outbound message
  • Retrieve an XML document from an XML column and modify it with XQuery update expressions before sending it out (DB2 10 for z/OS)
  • Constructing outbound XML messages from traditional relational data, using SQL/XML functions (DB2 V8 for z/OS and higher)

All of the XML-related processing such as extracting, shredding, modifying, or constructing XML can be accomplished by DB2 pureXML features that can be invoked from the CICS application.

If you want to know how to put it all together, I’ll refer you to two articles written by people who know a lot more about CICS than I do:

Using CICS with DB2 pureXML, Part 1: Perform basic XML storage and retrieval through CICS Web services
http://www.ibm.com/developerworks/data/library/techarticle/dm-1004cicsdb2purexml/

Using CICS with DB2 pureXML, Part 2: Perform advanced XML storage and retrieval through CICS Web services
http://www.ibm.com/developerworks/data/library/techarticle/dm-1011cicsdb2purexml2/

These articles describe how to implement most of the scenarios that I have listed above.

“Extremely pureXML in DB2 10 for z/OS”, that’s the title of a brand-new IBM Redbook that got released just a couple of weeks ago. This book describes the pureXML features in DB2 9 for z/OS and DB2 10 for z/OS based on an application scenario that runs through all chapters of the book. The application scenario is about receiving and processing bank-to-customer statements that follow the UNIFI (ISO20022) standard XML format.

The book shows how to obtain such UNIFI messages from a Websphere MQ message queue and how to persist and manipulate them with SQL/XML stored procedures or through Java or COBOL applications. Detailed code samples in Java and COBOL illustrate how you can query and process the XML messages and exploit the pureXML capabilities in DB2 for z/OS.

Additionally, database administrators will find chapters 9 and 10 useful, which provide coverage of XML-related admin task and describe how DB2 z/OS utilities work with XML data.

The outline of the book is as follows:

Chapter 1. Introduction
Chapter 2. XML and DB2 for z/OS
Chapter 3. Application scenario
Chapter 4. Creating and adding XML data
Chapter 5. Validating XML data
Chapter 6. DB2 SQL/XML programming
Chapter 7. Using XML with Java
Chapter 8. Using XML with COBOL
Chapter 9. Utilities with XML
Chapter 10. XML-related tasks for the DBA
Chapter 11. Performance considerations
Appendix A. Application scenario documents
Appendix B. Additional material

You also find coverage of some of the new XML features in DB2 10 for z/OS, such as binary XML data flow for JDBC applications, node-level XML Updates, and XML manipulation in user-defined functions and stored procedures.

Although the book provides brief introductions to XML, XPath, XQuery, SQL/XML, etc., the book is less focused on providing comprehensive tutorials for these core technologies and much more focused on illustrating their integration with each other in the context of DB2 for z/OS applications and databases. The book also touches on several advanced topics, such as handling incoming XML messages through Change Data Capture (CDC) and multi-version concurrency control for XML.

The Redbook “Extremely pureXML in DB2 10 for z/OS” is available for online reading and as a PDF at http://www.redbooks.ibm.com/redpieces/abstracts/sg247915.html?Open.

Other material on XML DB2 for z/OS that you may find useful include:

Happy reading !

Extremely pureXML in DB2 10 for z/OS

Today’s topic is “Name/Value Pairs” – a data modeling approach that is also known as Key-Value Pairs (KVP) or Entity-Attribute-Value model (EAV).

In the relational database world, the Name/Value Pair approach is sometimes used when the number of required columns in a table is very large, unknown, or changing over time. In that case it is difficult to define a table with a fixed set of columns. That’s when some people resort to Name/Value Pairs.

To give an example, consider the following record from a traditional database table with columns id, firstname, lastname, company, and dateOfBirth:

(1, John, Doe, IBM, 1978-11-05)

In a Name/Value Pair table, the same information would be stored across four different rows with columns id, name, and value:

(1, firstname, John)
(1, lastname, Doe)
(1, company, IBM)
(1, dateOfBirth, 1978-11-05)

The flexibility of this table is wonderful (at first!), because you can store records of pretty much any format in such a table, even if the set of required fields is not known in advance or growing over time. Instead of defining fixed column names, you simply stick the field names into the column name and pair them up with a value.

BUT….  it’s well understood that such Name/Value Pair tables have serious problems that cause a lof of headache over time. In particular:

  • It’s very hard to define constraints for Name/Value Pair tables, because the logical meaning of the column value changes from row to row. Data quality is hard to enforce.
  • Writing business queries against Name/Value Pair tables is complex and often requires many self-joins. Reporting queries with many predicates are particularly difficult and inefficient.
  • Column statistics that usually aid database optimizers in the selection of good access plans often do not work well for Name/Value Pairs.
  • Mixing values of different data types in the column value causes further complications. The column value needs to be of type VARCHAR which means that correct numeric or DATE comparisons are not straight forward, may require casting, and it’s difficult to define adequate indexes. Note that the numeric comparison value > 5000  is not the same as the string comparison value > ’5000′. The latter would also be true if value contains the string ’7′. A workaround is to define multiple value columns, one for each required data type, but this leads to sparsely populated columns with lots of NULL values. It’s just not pretty.

Others have blogged extensively about the problems with Name/Value Pairs in relational databases long before me, and here are some examples:

What I actually want to write about is the use of Name/Value Pairs in XML.

But.. wait a minute! Why would anybody use Name/Value Pairs in XML? XML is extensible and flexible by design, so there isn’t really a problem with being constrained by a fixed set of fields. New XML elements and attributes can be invented and added as needed, specifically to deal with variable and evolving data formats. Many applications use XML exactly for this reason.

However, I have seen various cases where people have decided to model their XML with Name/Value Pairs. As a simple example, the following XML document describes a car with Name/Value Pairs:

<object type="car">
    <field name="brand" value="Honda"/>
    <field name="price" value="5000"/>
    <field name="year" value="1996"/>
</object>

This XML document uses generic and non-descriptive element and attributes names, such as object, field, name, and value. This defeats the purpose of XML as a markup language where tag names should be used to describe the meaning of the different data items. This is similar to column names of a relational table, which should describe the data in the columns.

In this example, the values “brand”, “price”, and “year” are metadata and should really be tag names, because they describe the intended meaning of the values “Honda”, “5000″, and “1996″. Here is what this XML document should look like:

<car>
    <brand>Honda</brand>
    <price>5000</price>
    <year>1996</year>
</car>

The problems with using Name/Value pairs in XML are similar as in relational tables:

  • For the XML document with Name/Value Pairs in red above, it is impossible to define an XML Schema that enforces, for example, the following rule: If the attribute “name”  has the value ‘price’, then the attribute “value” in the same “field” element must be numeric and greater than 0. In contrast, this rule is easy to define in an XML Schema for explicit tag names.
  • Let’s say you often search by “year” and want an XML index to support such searches. If you use Name/Value Pairs, most databases will not allow you to define an index on only those “value” attributes where the “name” attribute in the same “field” element has the value ‘year’. As a result you will likely index more data than needed, which means your index is larger and more expensive to maintain than with explicit tag names. Insert, update, delete performance will suffer.
  • Writing XPath expressions against Name/Value Pairs is complex and less efficient.

Not yet convinced? Try to write an XPath that retrieves the year of all the Honda cars that have a price greater than 5000.  For Name/Value Pair documents the query would look like this (yikes!):

/object[@type="car" and
        field[@name = "brand" and @value = "Honda"] and
        field[@name = "price" and @value >= 5000]
       ]/field[@name="year"]/data(@value) 

For corresponding documents with explicit tag names, the same query is much simpler and will be faster:

/car[brand="Honda" and price >= 5000]/year

I rest my case.

Eduardo Ciliendo/Switzerland/IBM@IBMCH, Eileen Lin/Santa Teresa/IBM@IBMUS, Mauro Tibolla/Switzerland/IBM@IBMCH

XML processing is at the core of many middleware systems, often to route, split, or modify XML messages, or to extract values from them. Frequently, such XML processing is implemented with XSLT stylesheets or in application code with DOM or SAX parsers.

But, when XML messages are stored in a database system, application designers have a choice whether to perform XML manipulation in the application code or in the database server with XQuery or SQL/XML.

When a native XML database stores XML in a parsed format, such as in DB2 pureXML, then most XQuery and SQL/XML operations can be performed without additional XML parsing. This can be a significant performance benefit of processing XML in the database layer.

Coding XML manipulation declaratively in XQuery or SQL/XML is typically shorter than writing procedural application code that uses DOM or SAX. Hence, there is also a difference in development cost.

We have performed some tests to compare the performance and development cost of using SQL/XML or XQuery in DB2 as opposed to using DOM or SAX in a Java application. We started by defining several common XML processing uses cases:

  • Use Case 1.1 (“Extract5″): Extract a small number of values from randomly selected XML documents (approx5% of the element values)
  • Use Case 1.2 (“Extract50″): Extract a medium number of values (50%).
  • Use Case 1.3 (“Extract100″): Extract most values (almost 100%).
  • Use Case 2.1 (“Split500″): Split a large XML document containing 500 concatenated XML fragments into individual XML documents. Concatenated XML documents are often used for download or FTP, but are not useful for individual processing.
  • Use Case 2.2 (“Split2500″): Split a document containing 2500 fragments.
  • Use Case 2.3 (“Split5000″): Split a document containing 5000 fragments.
  • Use Case 3 (“Modify”): Insert, change, and delete XML elements in each randomly selected XML document.

We have implemented these uses case in a Java application in two ways:

(a) using SQL/XML and XQuery to push the XML manipulation to a DB2 9.7 database, and
(b) using a DOM and/or SAX parser in the Java code itself.

Both options read or write XML to/from a DB2 database, but the location and implementation of the XML manipulation differs. In option (a), the XML manipulation happens in the database as part of the database read or write operations. Option (b) reads and writes full documents from/to the database, but XML specific operations happen exclusively in the Java code. We measured all tests on the same hardware to ensure an apples-to-apples comparison.

Let’s look at some of the results…

The following chart shows the throughput of the three “Extract” use cases in transactions per second. The SQL/XML-based value extraction in the database provides notably higher throughput than SAX parsing in Java, especially when the number of extracted values is small.

We observed similar performance benefits for performing “Split” operations (use cases 2.1, 2.2, and 2.3) in the database.

Modifications to XML document are often implemented with XSLT or a DOM parser in application code. However, we find that inserting, updating, or deleting XML elements and attributes in a document can be significantly faster when XQuery Update expressions are run inside the database. The next chart shows the throughput for a mix of common XML Update operations for 1, 5, 10, and 15 concurrent users:

Even more important than the performance benefit of in-database XML manipulation might be the reduction in development cost.

We counted the lines of code for both of our implementations, including the Java code to prepare and submit SQL/XML or XQuery statements and to bind extracted values to Java objects. The following chart shows that application-level XML processing can require 10x to 13x as many lines of code than database-level XML manipulation.

This is very significant since the number of lines of code is a common metric to estimate the software development cost, number of coding errors, and software maintenance cost over time. Although some (but not nearly all!) of the development labor can be reduced by using XML mapping frameworks (such as JAXB) XQuery and SQL/XML are still much shorter and typically less costly to develop and maintain.

For more details on this topic, see:

How to insert XML documents from the file system? Let’s discuss this common question for both, DB2 on z/OS and DB2 on Linux, UNIX, and Windows (LUW).

DB2 for z/OS offers two good options for moving XML documents from the file system into XML columns or LOB or varchar columns. The first option is to use the DB2 LOAD utility, which is described here. The second options is to use File Reference Variables in your application code to read LOB or XML files from the file system. This is a native solution and typically performs very well.

In a COBOl, PL/1, or C application, you can use a file reference variable of type BLOB_FILE, CLOB_FILE, or DBCLOB_FILE to insert a LOB or XML value from a file into a DB2 table. LOB file reference variables move LOB or XML values from an application to the database server without going through the application’s memory. Their usage is described here:

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db29.doc.apsg/db2z_lobfilereferencevar.htm

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc.apsg/db2z_xmpdeclarefilereferencevar.htm

 

In DB2 LUW, a good way to move XML documents from a file system into a DB2 database is to use the DB2 LOAD or the DB2 IMPORT utility. Alternatively, applications can use custom code to read XML documents from their local file system and bind them to parameter markers of inserts statements that are then submitted to DB2.

In some cases you may want to insert XML files that reside in the file system of the DB2 database server – directly with insert statements, without using LOAD or IMPORT. If you have file access permissions on the database server (which might be restricted) then you can use simple user-defined functions (UDFs) in your insert statements to read one or multiple files from the file system. Here is a package “XMLFRomFile.zip” with examples of such UDFs.

The XMLfromFile functions in this package are described in section 4.1.2 of the DB2 pureXML Cookbook and also in the readme file that is included in the package. Some of most useful functions include:

  • blobFromFile: Reads a file from the DB2 server’s file system and returns the file contents as a BLOB. If this BLOB contains a well-formed XML document, it can be inserted into an XML column.
  • clobFromFile: Reads a file from the DB2 server’s file system and returns the file contents as a CLOB. If this CLOB contains a well-formed XML document, it can be inserted into an XML column.
  • directoryInfo: Returns a table with information about files in a directory in the DB2 server’s file system.
  • blobsFromZipURL/clobsFromZipURL: Table functions that read a ZIP file from a URL or file and return a table that contains each file from the ZIP in a separate row, as a BLOB or CLOB, respectively. Similar functions exist for gzip files.

And here are some examples of how you might use these functions.

Read the XML file “book07.xml” and insert it into the table called “shelf” where the column “bookinfo” is of type XML or BLOB:

INSERT INTO shelf(id, bookinfo)
VALUES(7,blobFromFile(‘/home/mydata/book07.xml’));

List all files that are located in the directory c:\xml\book:

SELECT filename, size, modtime
FROM TABLE(directoryInfo(‘c:\xml\book’)) ;

Insert all XML files from the directory c:\xml\book:

INSERT INTO shelf(name, bookinfo)
SELECT t.filename, blobFromFile(t.filename)
FROM TABLE(directoryInfo(‘c:\xml\book’)) t
WHERE isDirectory = 0 AND filename LIKE ‘%.xml’ ;

Insert all XML files from the ZIP file allbooks.zip:

INSERT INTO shelf(bookinfo)
SELECT doc FROM
TABLE(blobsFromZipURL(urlFromFile(‘c:\xml\book\allbooks.zip’)))
WHERE filename LIKE ‘%.xml’OR filename LIKE ‘%.XML’ ;

Note that the ZIP file does not get unzipped to the file system. Instead, the documents are read from the ZIP file and inserted directly into DB2 without uncompressing them in the file system.

XML Tools for DB2 pureXML

December 5, 2010

There is a new article on developerWorks that you may find interesting. It describes the XML capabilities in various database and software development tools, the different job roles that are impacted by having XML in the database, and the specific tasks involved. The article can help you choose the right tool for specific tasks and job roles.

After a very brief introduction to DB2 pureXML, the first part of the article lists a set of database tools, each with a screenshot, and lists their key features. These tools include:

  • DB2 Control Center
  • Data Studio and Optim Development Studio (ODS)
  • Infosphere Data Architect (IDA), formerly known as Rational Data Architect (RDA)
  • Rational Application Developer (RAD)
  • Rational Software Architect
  • InfoSphere Warehouse Design Studio

The second part of the article discusses different job roles (DBA, application developer, architect, etc.) and how they relate to XML-specific tasks in a project life-cycle. Subsequently, the article describes these tasks in more detail and maps them to available features in the above-mentioned tools. Towards the end of the article you find summary tables that show which tools contain which XML features, and which tools support which XML tasks. These tables alone are worth the article!

Although the article focuses mainly on IBM tools, it also points to non-IBM tools such as Altova XMLSPY, <oXygen/>Stylus Studio, and others.

Oh, before I forget, here is the link to the article:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1012xmltools/index.html#NonIBM_XML_tooling

Happy reading!

In my previous blog post I wrote about the new XML enhancements in DB2 10 for z/OS, the new release of IBM’s relational database for the mainframe. One of the exciting new XML features are node-level XML updates.

While DB2 9 for z/OS only supported full-document replacement, DB2 10 for z/OS allows you to insert, delete, or modify individual XML elements or attributes (“nodes”) within an XML document. This is based on the XQuery Update Facility, which defines the syntax and semantic of node-level XML updates.

In DB2 10 for z/OS, such XML updates are always embedded in an SQL Update statement, such as the following statement that updates the XML column “info” in the table “customer”:

UPDATE customer
SET info =  XMLMODIFY( <xquery update expression>)
WHERE ….

The WHERE clause of the UPDATE statement can contain relational and/or XML predicates to select zero, one, or multiple rows (documents) for update. For each qualifying row the XMLMODIFY function applies the XQuery Update expression to the respective XML document.

The XMLMODIFY function is new in DB2 10 for z/OS and can only be used on the right-hand side of a SET clause in an UPDATE statement.

The following update expressions are supported in the XMLMODIFY function:

Let’s look at some examples for each of those update expressions.

The following UPDATE statement modifies the document with the relational id “123″ and changes the value of the <city> element to “San Jose”:

UPDATE customer
SET info =  XMLMODIFY(‘replace value of node /customer/addr/city with “San Jose” ‘ )
WHERE customer.id = 123

If you want to update an element or attribute that occurs multiple times per document then you need to select exactly one of its occurrences. In other words, the target of the update expression must be a single node per document. It cannot be zero or multiple nodes per documents. Otherwise an error is returned.

Assuming that a customer document can have multiple <phone> elements, the following UPDATE statement changes the value of the phone element whose @type attribute has the value “home”:

UPDATE customer
SET info =  XMLMODIFY(‘replace value of node /customer/phone[@type = "home"]  with “408-463-4963″  ‘ )
WHERE customer.id = 123

The new value can also be provided via a parameter marker or host variable. For example, the next UPDATE provides the new phone number via the host variable :newnum. The second argument of the XMLMODIFY function contains the host variable and assigns an XQuery variable name (“new”) that can be referenced with a $-sign in the update expression itself. In this way the variable $new refers to the host variable:

UPDATE customer
SET info =  XMLMODIFY(‘replace value of node /customer/phone[@type = "home"]  with $new ‘, :newnum as “new )
WHERE customer.id = :cid

Next, let’s assume we need to replace the home phone number with a cell phone number. This can be done with a node replacement. The following UPDATE provides a new <phone> element (shown in green). It uses the XMLPARSE function to convert the text representation of the new element to data type XML,  and assigns the new element to the variable $new. This variable is used in the “replace node” expression to replace the selected existing phone element:

UPDATE customer
SET info =  XMLMODIFY(‘replace node /customer/phone[@type = "home"]  with $new ‘ ,
XMLPARSE(document ‘<phone type=”cell”>408-123-4567</phone>‘)  as “new”
)
WHERE customer.id = 123

Again, the new element could also have been provided  by a host variable.
You can also add and remove nodes from a document. The next UPDATE deletes the phone element with the cell phone number from the document:

UPDATE customer
SET info =  XMLMODIFY(‘delete nodes /customer/phone[@type = "cell"] )
WHERE customer.id = 123

Note: the “delete” expression is the only update expression that is allowed to affect zero, one, or multiple nodes per document. While the “replace” expression can only be applied to one node per document, the “delete” expression can remove multiple nodes at once!  The following UPDATE deletes all <phone> elements that appear on the path /customer/phone:

UPDATE customer
SET info =  XMLMODIFY(‘delete nodes /customer/phone )
WHERE customer.id = 123

To insert a new element, attribute, or document fragment you need to provide the new node and specify a target location where the new node should be placed in the document. For example, the next UPDATE statement inserts the new <phone> element (in green) into the document so that it appears after the existing  <email> element:

UPDATE customer
SET info =  XMLMODIFY(‘insert node $new  after /customer/email ,
XMLPARSE(document ‘<phone type=”cell”>408-123-4567</phone>‘)  as “new”)
WHERE customer.id = 123

Instead of using the keyword “after” you can also specify that a new node should be inserted “before” an existing element or as a child “into” an existing element. When you use the keyword “into” to add a node as a new child to an existing element, then you can also specify “as first into” or “as last into” to explicitly make it the first or the last child node.

The XPath the specifies the target location of the new node, such as /customer/email in the example above, must identify exactly one node, and must node lead to zero or multiple nodes.

And finally, if your XML data contains namespaces, you must declare the namespaces in the XMLMODIFY function just like you would in an XMLEXISTS or XMLQUERY function. This next update declares a default element namespace that applies to all elements in the XPath /customer/addr/city:

UPDATE customer
SET info =  XMLMODIFY(”declare default element namespace “http://www.example.com”;  replace value of node /customer/addr/city with “San Jose” ‘ )
WHERE customer.id = 123

From here on you should try it out for yourself, or let me know if you have any questions!

 

 


The release of DB2 10 for z/OS was one of the big topics at IBM’s Information on Demand conference in Las Vegas and at the IDUG Europe conference in Vienna a couple of weeks ago. DB2 10 for z/OS is the brand-new version of the DB2 database for mainframe computers running the z/OS operating system.

If you think the mainframe is dead – think again! Mainframe computers are still running many of the world’s most mission-critical databases and applications. Many leading companies in banking, insurance, telecommunications, manufacturing, logistics, and other industries still rely on the mainframe’s unsurpassed reliability and performance for transaction processing. And the wide-spread adoption of XML does not exclude the mainframe world.

DB2 9 for z/OS introduced pureXML with native XML storage, XML indexing, XML Schema support, SQL/XML queries, and XML support in various DB2 utilities such as load, unload, and others. Overall, the XML support in DB2 9 for z/OS is similar to that in DB2 for Linux, UNIX, and Windows (LUW).

The new XML features in DB2 10 for z/OS include:

  • Node-level XML updates: While DB2 9 for z/OS only supported full-document replacement, DB2 10 for z/OS allows you to insert, delete, or modify individual XML elements or attributes within an XML document. This is based on the XQuery Update Facility and is similar to the XML update support in DB2 LUW.
  • Stored procedures and user-defined functions (UDFs) can now have parameters and variables of data type XML. This allows for more flexible and more powerful XML application development.
  • Binary XML for data transmission between database server and client applications. This new binary encoding reduces the size of XML on the wire and the associated network latency. As a result, XML insert and retrieval operations are faster.
  • Concurrency control for XML is now based on a new multi-versioning approach, which allows for higher concurrency and performance. It avoids XML locking for readers. (Note: Multi-versioning does require new function mode.)
  • Increased support for native XML date and time data types.
  • One or multiple XML Schemas can be assigned to an XML column, as a so-called XML column type modifier. As a result, schema validation happens automatically for insert, update, and load operations. This XML column type modifier can be altered and supports schema evolution.
  • XML Schema validation can be offloaded 100% to zIIP and zAAP processors, which reduces CPU cost. This enhancement has also been added to DB2 9 for z/OS (PK90032, PK90040).
  • Enhanced XML index support: XML indexes can now be used to for case-insensitive search (if desired) and to check for the existence or absence of an element or attributes regardless of its value. This has been added to DB2 9 for z/OS too (PK80732, PK80735).
  • Support for XPath functions fn:matches, fn:replace, and fn:tokenize, which allow for powerful string manipulation.

    And there are even more XML enhancements in DB2 10 for z/OS that I haven’t listed above, such as enhancements in utilities like CHECK DATA, LOAD, and UNLOAD. Some of the items on the list above deserve a more detailed discussion and some examples, and I’ll try to get to that in subsequent blog posts. Let me know if you have any specific questions on any of these features.

    Next week in Vegas…

    October 21, 2010

    What happens in Vegas, stays in Vegas – or so they say. But, that’s not going to be quite true for IBM’s Information on Demand Conference on 24-28 October at the Mandalay Bay Hotel and Resort.

    The IOD conference is packed with a broad range of technical sessions, business sessions, hands-on labs, and lots of networking opportunities. More than 160 customer speakers will present their first-hand experiences with IBM Software. For example, one customer will talk about their migration of an XML database application from Oracle to DB2 9.7 (session TLU-2928A). And then there is the Expo, where an impressive set of business partners shows their solution and services.

    DB2 pureXML is going to be well represented at the conference, most notably with a set of hands-on labs for both DB2 on Linux, Unix, Windows and DB2 for z/OS. Some of the labs you can choose from include:

    HOL 1468 – PureXML for DBAs: Setup and Storage for OLTP and Warehouse
    HOL 1480 – Creating XML Solutions with DB2 pureXML
    HOL 1152 – Learning pureXML in DB2 for z/OS with IBM Optim Development Studio
    HOL 1170 – Learning SQL/XML with CLP and SPUFI
    HOL1391  – New Features and Enhancements in IBM DB2 10 for z/OS
    HOL1409 – Getting Started with IBM DB2 for z/OS pureXML for DBAs and Developers

    Whether you attend presentations, roam around in the Expo, or gain practical skills in the hands-on labs, I’m sure you’ll take a home a wealth of valuable information. So, a lot of what happens in Vegas next week is not going to stay in Vegas…

    You’ll most likely be able to meet some of your very favorite XML people at Information on Demand Conference on 24-28 October in Las Vegas, e.g.,:
    Lee Ackerman, Jane Man, Matthias Nicola, Bryan Patterson, Guogen Zhang
    in meet the expert sessions, in presentations (there is a session on migrating an Oracle XML application to DB2) or in Hands on Labs such as:
    HOL 1468 – PureXML for Database Professionals: Setup and Storage for OLTP and Warehouse
    HOL 1480 – Creating XML Solutions with DB2 pureXML
    HOL1391  - New Features and Enhancements in IBM DB2 10 for z/OS
    HOL1409 – Getting Started with IBM DB2 for z/OS pureXML for DBAs and Developers
    HOL 1391 – Learning Exciting new Features and Enhancements in DB2 10 for z/OS
    HOL 1152 – Learning pureXML in DB2 for z/OS with IBM Optim Development Studio
    HOL 1170 – Learning SQL/XML with CLP and SPUFI
    The Hands on Lab (HOL) numbers may change.
    You can register for IoD here http://www-01.ibm.com/software/data/2010-conference/

    On Tuesday, October 12, at 12noon US Eastern Time (9am US Pacific Time) there is an interesting webcast and discussion on using Rational Software Architect (RSA) and Rational Application Developer (RDA). The speakers are Lee Ackerman and Jean-Louis Marechaux, two high caliber experts in IBM’s Information Management and Rational division, respectively.

    With the increased prevalence of XML in industry solutions, SOA and Web 2.0, this webcast will look at how you can use Rational Software Architect/Rational Application Developer along with DB2 pureXML to work with and manage XML data. The session will briefly introduce the pureXML capabilities of DB2 and discuss how you can generally work with XML in software solutions. The webcast will then explain how Rational Software Architect/Rational Application Developer can help you model, generate, query and code XML-based software solutions. For example, you will learn about the XML-related editors and wizards in RSA and RDA, among many other things. RSA and RDA offer support for the manipulation of XML documents, XML Schemas, DTDs, XSLT, XQuery, SQL/XML, web services and WSDL, SOAP, and other SOA technologies.

    To attend the webcast, please call into: 1-877-848-7046 or 1-636-651-0036 with passcode 4258904. Also, please  contact Susan Malaika to be gain access to the web chat where you can ask questions during the session and download the slides.

    More information on Rational Software Architect and Rational Application Developer can be found here:

    Please call into: 1-877-848-7046 or 636-651-0036 – passcode: 4258904
    And please also join this Web chat http://webconf.soaphub.org/conf/room/xmldevotee When you enter the Web chat, you can use the settings option to change your name from anonymous.
    The charts will be made available in the Web chat room.

    Prior to  the emergence of native XML database technology, many XML applications have stored their XML documents in files systems, in LOB columns (large objects) in a database, or they have shredded the XML to relational tables.

    A common question is how to migrate XML documents from a LOB column to an XML column. The benefit of an XML column is that the XML is stored natively, i.e. in a parsed format, so it becomes possible t0 index, query, and update the XML efficiently. Migrating XML from LOB to XML columns is easy. Let’s take a look at how you can do this in DB2 for Linux, UNIX, and Windows:

    There are several different options, at least three:

    1. You can export the old table (with the LOB column) and then load into a new table where the LOB column has been replaced by an XML column. You can then drop the old table and rename the new table to the old table name.
    2. If you don’t want to export the table to the file system, you can use “load from cursor” to move the data directly from the old to the new table.
    3. You can add an XML column to the original table, copy the data from the LOB column to the XML column, and then drop the LOB column.

    To illustrate these options, let’s first create a table with a LOB column and insert a couple of XML documents:

    CREATE TABLE oldtable(id INT, doc CLOB);

    INSERT INTO oldtable VALUES(1, ‘<url>http://nativexmldatabase.com/</url>’);
    INSERT INTO oldtable VALUES(2, ‘<url>http://tinyurl.com/pureXML</url>’);

    Now we can try each of the three options.

    1. Export + Load

    The following commands export the “oldtable”, create a new table, and load the exported data into the new table:

    EXPORT TO old.del OF DEL MODIFIED BY LOBSINFILE SELECT * FROM oldtable;
    CREATE TABLE newtable1(id INT, doc XML);
    LOAD FROM old.del OF DEL INSERT INTO newtable1(id,doc);

    Yes, it’s really that simple.

    Be aware that CLOB or BLOB columns are very forgiving, i.e. they allow you to insert pretty much anything and they don’t check whether any inserted XML is well-formed or not. If your LOB column contains any XML documents that are not well-formed then those document will be rejected when you try to load them into the XML column. The LOAD command gives you a summary of how many rows (documents) were successfully loaded and how many were rejected. In our example the load summary looks like this:

    Number of rows read         = 2
    Number of rows skipped      = 0
    Number of rows loaded       = 2
    Number of rows rejected     = 0
    Number of rows deleted      = 0
    Number of rows committed    = 2

    After your new table is loaded, you could drop the old table and rename the new one (subject to certain restrictions on the rename command):

    DROP TABLE oldtable;
    RENAME TABLE newtable TO oldtable;

    2. Load from cursor

    Loading from a cursor means that you read data from one table and load it into another table. Instead of exporting the “oldtable” you simply declare a cursor for it, which the LOAD command can use as input. In fact, the cursor declaration can contain any SQL query, and the result of that query then defines the input to the LOAD command:

    CREATE TABLE newtable2(id INT, doc XML);
    DECLARE cur1 CURSOR FOR SELECT id, doc FROM oldtable;
    LOAD FROM cur1 OF CURSOR INSERT INTO newtable2(id,doc);

    You could potentially also use an SQL insert statement such as “INSERT INTO newtable2 SELECT * FROM oldtable” to move data from one table to another. However, this requires all the data to be logged and you may run into a log full error. Also, the LOAD utility can write data to a table faster than an insert statement can.

    Based on available system resources the LOAD utility automatically chooses an appropriate degree of parallelism and other tuning options. You can sometimes improve LOAD performance (and the degree of LOAD parallelism) if you increase DB2′s utility heap (util_heap_sz). For very large LOAD operations you may also want to review the general tuning guidelines for LOAD utility.

    3. Add XML column, drop LOB column

    You can use the following commands to add an XML column to the original table, copy the data from the LOB column to the XML column, and then drop the LOB column:

    ALTER TABLE oldtable ADD COLUMN xmldoc XML;
    UPDATE oldtable SET xmldoc = doc;
    ALTER TABLE oldtable DROP COLUMN doc;

    The potential problem here is that the UPDATE statement is a large transaction that writes a lot of data that needs to be logged. You might run out of log space.  To avoid this problem, you could:
    (a) use batch-wise updates, i.e. update 1000 rows, then commit, then update the next 1000 rows, and so on.
    (b) use ” alter table oldtable activate not logged initially;” to temporarily disable logging for the table. (Note that this statement has effect only for a single transaction, so you must issue it before to the UPDATE statement and in the same transaction. For this to work in the DB2 Command Line Processor, autocommit must be disabled, e.g. “db2 +c -t”)

    General migration considerations

    No matter whether you create and populate tables with XML columns from scratch or migrate XML documents from LOB to XML columns, you always have a set of physical design choices for your XML column and table. These include the page size (large pages tend to be better for XML performance), XML inlining, using a separate table space or buffer pool for the XML data, and so on. These decisions depend on the nature of your applications that use your database. When in doubt, keep it simple. However, it is worthwhile to point out that LOB columns do not use any buffer pool while the XML columns do. Hence, the migration from LOB to XML column means that more data will be buffered than before. Hence, you may want to increase the existing buffer pool for the table, if you can.

    Let me start this topic with a couple of disclaimers. First, using XML as a storage format is not always the right choice – there are still valid cases where it can make sense to convert XML to relational format. Second, there are more than five reasons for storing XML in a database. Here I am just listing five reasons –in no particular order- that seem to be quite common.

    1. When business records are represented as XML when they are processed or transmitted by applications.

    When applications have already decided to use XML as the data format of choice, e.g. in a service-oriented architecture (SOA), and are transmitting and/or processing business data in XML format anyway, then it is often a good idea to also store the business data in the same XML format. If the applications use XML anyway, it is typically beneficial to avoid the overhead of disassembling XML just for storage, and to avoid reconstructing the XML when applications need to read data.

    2. When the data format (schema) changes over time.

    Let’s face it: schema changes in a relational database can be a pain in the neck. Just imagine a 1-to-1 relationship that evolves to a 1-to-many relationship. For example, assume that customer records are no longer restricted to one telephone number but can contain multiple phone numbers. Following traditional relational database design principles, you would now have to normalize your existing tables, i.e. introduce a new table for phone numbers with a 1-to-many join relationship from the customer table. This is a massive change. Any SQL statements that retrieve phone numbers or have search conditions on phone numbers need to be changed. However, if the customer records are stored as XML documents in an XML column, then the table definition does not need to be changed to accommodate the evolution from 1-to-1 to a 1-to-many relationship. An XML element “phone” would simply be allowed to occur multiple times per document. An XML index on these phone elements would also remain unchanged. With XML, the impact of schema evolution is (not zero but) much lower than with a relational database schema. This enables applications to react to changing business requirements faster and at a lower cost.

    3. When the data format (schema) is complex and highly variable.

    In today’s world, business record are often very complex, typically more complex than 50 years ago. Think of financial trades, electronic medical records, insurance contracts and claims, or rich content such as office documents. The complexity and variability and the number of optional data fields in such business records can be staggering. Designing a relational schema for such complex objects can be very challenging and can lead to dozens or even hundreds of tables that are difficult to understand from an application perspective. In such cases, storing each business record as a separate XML document can often simplify both database design and application development.

    4. When you need to optimize object-centric data access.

    Continuing the previous thought, what happens when complex business objects (trades, medical records, insurance claims, etc.) are mapped to and stored in a complex and normalized relational database schema? Storing a single business object may require SQL inserts into many tables, and reading a single business object may require complex multi-way joins and/or a sequence of many SQL queries to read all pieces of an object. In such cases, experience has shown that inserting or retrieving a single XML document is significantly more efficient. For such object-centric data access, XML can outperform a relational solution.

    5. When you need the best of both worlds.

    The world is not black and white. In particular, many application scenarios are not either 100% XML or 100% relational. Often, some data is better represented in one format while other is better represented in another. This makes a hybrid database design very appealing, since it allows you to manage both XML documents and relational data in the same database and even in the same table. You can than query across XML documents and relational data at the same time, even in a single query. For example, you can keep the fixed and structured parts of a record in relational columns and the variable or semi-structured parts of the same record in an XML column.

    Follow

    Get every new post delivered to your Inbox.

    Join 28 other followers