IBM has released a new white paper on use cases and benefits of using XML as a data format in a data warehouse:

XML: Changing the data warehouse
ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/imw14238usen/IMW14238USEN.PDF

The article describes cases where using XML in a warehouse can be beneficial and improve the flexibility and evolution of a data warehouse schema to react to new requirements in a simple and cost-effective manner. Typically this is not a question of either XML or relational, but a question of using the right mix of both.

I think there are at least three use cases for using DB2’s pureXML features in a warehousing environment:

  1. You have or receive XML documents and you want to use them in a warehouse (for reporting queries, etc.) without shredding them to relational tables. You will typically use SQL/XML queries, possibly plugged into reporting tools such as Cognos, or you can use relational views over the XML data.
  2. You can add XML columns to existing relational dimension tables to make these tables more flexible, especially when the number of dimensional attributes keeps increasing as a company keeps adding new products and services over time. The above-mentioned white paper elaborates on this use case.
  3. In some cases you may prefer to shred incoming XML documents partially or entirely to a relational warehouse schema. In that case you will use the pureXML features too, such as XMLTABLE functions to accomplish the shredding or even simple ETL-ish tasks. However, not all types of XML lend themselves to shredding, in which case you’re back to point 1. above.

If you want to dive deeper into the technical features in DB2 that support XML in a data warehouse, such as partitioning, clustering, and compression, you may want to look at the following article:

Enhance business insight and scalability of XML data with new DB2 9.7 pureXML features
http://www.ibm.com/developerworks/data/library/techarticle/dm-0904db297purexml/

A common question related to XML in a data warehouse is how to use reporting tools with XML data, or with a mix of XML and relational data. Here is an article that describes how to do this with Cognos on top of DB2 pureXML:

Create business reports for XML data with Cognos 8 BI and DB2 pureXML http://www.ibm.com/developerworks/data/library/techarticle/dm-0811saracco/index.html



And to round off the data warehouse topic, there is a two-part article on IBM InfoSphere DataStage and DB2 pureXML:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0908datastagepurexml1/
http://www.ibm.com/developerworks/data/library/techarticle/dm-0909datastagepurexml2/

tokenize it !

January 25, 2010

In my previous post I wrote about XPath and XQuery functions in DB2 vs. SQL Server 2008. One particularly powerful function is tokenize. The tokenize function takes a string as input and splits it into several substrings based on a repeating delimiter. The delimiter can be a single character or a pattern defined by a regular expression. The tokenize function is available in DB2 for z/OS (here) and DB2 for Linux, UNIX, and Windows (here).   

Here is a very simple example where the string “1-3-26-43-74” is split along the character “-“. This eample is executed in the Command Line Processor of DB2 for Linux, UNIX, and Windows:   


db2 => xquery tokenize("1-3-26-43-74", "-");

1
3
26
43
74
 
  5 record(s) selected.
 
db2 =>

You might sometimes see that XPath function names are prefixed with “fn:”, which is the namespace prefix for functions. However, the namespace for XPath functions is implicitly predeclared so that you can omit the prefix “fn:”. That is, you can simply use the function name tokenize instead of fn:tokenize, as I did above. 

Here is another example where the tokenize function comes in handy.  In this example an XML document describes (some of) the environment variables of a user account on a Linux machine. Some enviroment variables, such as PATH, CLASSPATH, etc. contain lists of paths delimited by the colon (:) sign.   


create table myenv(doc XML);

insert into myenv values('
<ENVIRONMENT os="Linux">
   <HOSTNAME>hermes</HOSTNAME>
   <PATH>/home/mnicola/bin:/usr/bin:/usr/local/bin:/opt/newapp/bin:
         /home/mnicola/sqllib/bin:/home/mnicola/sqllib/adm:
         /home/mnicola/sqllib/misc:/home/mnicola/sqllib/int</PATH>
   <MANPATH>/usr/local/man:/usr/share/man:/usr/X11R6/man</MANPATH>
   <CLASSPATH>.:/home/mnicola/sqllib/java/db2java.zip:/home/mnicola/sqllib/java/db2jcc.jar:
              /home/mnicola/sqllib/java/sqlj.zip:/home/mnicola/sqllib/java/runtime.zip:/test</CLASSPATH>
</ENVIRONMENT>');
 

Then you can use the following simple query in DB2 to read and tokenize the CLASSPATH and return each path on a separate row:

xquery
db2-fn:xmlcolumn("MYENV.DOC")/ENVIRONMENT[HOSTNAME="hermes"]/tokenize(CLASSPATH, ":");
 
--------------------------------
.
/home/mnicola/sqllib/java/db2java.zip
/home/mnicola/sqllib/java/db2jcc.jar
/home/mnicola/sqllib/java/sqlj.zip
/home/mnicola/sqllib/java/runtime.zip
/test
 
  6 record(s) selected.
 

To be precise, the tokenize function returns a sequence of strings. On a sequence you can apply positional predicates to pick selected items from the sequence. The following query uses the tokenize function with positional predicates [1], [2], and [3] to obtain the first, second, and third entry from the MANPATH enviroment variable: 


SELECT XMLQUERY('$d/ENVIRONMENT/tokenize(MANPATH, ":")[1] '
                 passing doc as "d") as path1,
       XMLQUERY('$d/ENVIRONMENT/tokenize(MANPATH, ":")[2] '
                 passing doc as "d") as path2,
       XMLQUERY('$d/ENVIRONMENT/tokenize(MANPATH, ":")[3] '
                 passing doc as "d") as path3
FROM myenv
WHERE XMLEXISTS('$d/ENVIRONMENT[HOSTNAME="hermes"]' PASSING doc as "d");
 

PATH1                PATH2                PATH3
-------------------- -------------------- --------------------
/usr/local/man       /usr/share/man       /usr/X11R6/man
 
 1 record(s) selected.
 

In the next example the XPath function count is wrapped around tokenize to obtain the number of entries of the CLASSPATH.


SELECT XMLQUERY('$d/ENVIRONMENT/count(tokenize(CLASSPATH, ":")  )' 
                 passing doc as "d")  as length_classpath
FROM myenv
WHERE XMLEXISTS('$d/ENVIRONMENT[HOSTNAME="hermes"]' passing doc as "d");

LENGTH_CLASSPATH 
-----------------------------
6 

 1 record(s) selected.
 

I’m sure you will find many more use cases where the tokenize function is helpful!

XQuery is a feature-rich language that allows you to code sophisticated queries against XML data. As you probably know, the XPath language is a subset of XQuery. XPath and XQuery share a common set of built-in functions that are defined by the W3C specification titled “XQuery 1.0 and XPath 2.0 Functions and Operators“. This spec describes dozens and dozens of functions that deal with strings, numbers, dates, timestamps, XML nodes and sequences, and so on.

The DB2 support for the XPath and XQuery functions is documented in the DB2 Information Center, and you might find corresponding information for other database systems on their respective web sites. However, other databases might not support as many of the XQuery functions as DB2. For example, DB2 9.7 for Linux, UNIX, and Windows supports 21 XQuery string functions while SQL Server 2008 seems to have just six. DB2 also supports over 30 XQuery date and time functions, SQL Server seems to support none of those.

Admittedly, some functions are more commonly used then others. XQuery functions such as “substring” or “count” are used quite often. In contrast, the XQuery functions “reverse” and “string-to-codepoints” tend to be used much less frequently.

But then there are some functions that you might not use every day, but when you do need them you really don’t want to miss them. For example, sometimes you may have to
• search for strings that start with a given prefix (starts-with)
• determine whether a string matches a given pattern (matches)
• remove unnecessary spaces from a string (normalize-space)
• compute the absolute value of a number (abs)
• convert a timestamp to a different time zone (adjust-dateTime-to-timezone)
• decide whether two XML documents are identical in all their structure and values (deep-equal)

These and other tasks are common enough. Yes, you can write application code to perform such tasks, but that is much less convenient and less efficient than using a built-in function that’s readily available. Being aware of the available built-in functions can make your life much easier when you develop XML applications. The functions listed above are available in DB2 pureXML, but these and many others don’t seem to be supported in SQL Server 2008.

Some time ago I received an interesting question about how to code a specific XPath expression. I think the question and the answer are worthwhile sharing here. 

The question: 

Consider the following XML document. The document describes an article that has a title, a publication date, and several sections. Each section has a body, and some sections also contain a reference to a figure. Admittedly, I contrived this document to be able to explain this issue in a simple way. 

 <article id="62083">
  <title>A short story about apple pie</title>
  <pubdate>2009-08-21T09:30:00</pubdate>
  <section>
      <body>Once upon a time...</body>
  </section>
  <section>
      <body>...</body>
      <figure>drawing15.jpg</figure>
  </section>
  <section>
      <body>...</body>
      <figure>screenshot85.tif</figure>
  </section>
  <section>
      <body>...</body>
  </section>
</article>

Now the question is: how can I retrieve the first or the last <figure> from the article? The answer is slightly more tricky than what you might think! You probably know that XPath allows for positional predicates that choose elements based on the order in which they appear. For example, if the variable $doc holds the document shown above, then the following XPath expressions retrieve the first and the last <section> of the article, respectively:

$doc/article/section[1]
$doc/article/section[last()]

But, the question was how to retrieve the first or last <figure>. Intuitively, one might try the following XPath expression to obtain the first <figure> element:

$doc/article/section/figure[1]

However, this XPath returns both <figure> elements from the document:    

    <figure>drawing15.jpg</figure>
    <figure>screenshot85.tif</figure> 

This result may seem unexpected, but it is perfectly correct. The XPath expression has asked for the first figure within each section, and that’s exactly what we got: the first figure from each section. It’s just not what we wanted.   

The solution:

There are several ways to retrieve the first <figure> of the entire document rather than the first <figure> of each section. One approach is to use parentheses in order to apply the positional predicate [1] to the sequence of all <figure> elements, rather than to the figures within each section:

($doc/article/section/figure)[1]

Another approach is to navigate to the first section that actually contains a <figure>, and then to return the first <figure> element from that section:

$doc/article/section[figure][1]/figure[1]

Note that the XPath step section[figure][1] uses two predicates. The predicate [figure] selects all sections that contain a figure, and the predicate [1] selects the first of those sections. Similarly, you can use either one of the following two XPath expressions to retrieve the last <figure> element from our sample document:

($doc/article/section/figure)[last()]
$doc/article/section[figure][last()]/figure[last()]

With the attached script you can try it out yourself – in DB2 for Linux, UNIX, and Windows or DB2 for z/OS. (Click on the “source” link below.)

 

create table mytest(mydoc XML);
 

insert into mytest(mydoc) values('<article id="62083">
 <title>A short story about apple pie</title>
 <pubdate>2009-08-21T09:30:00</pubdate>
 <section>
  <body>Once upon a time...</body>
 </section>
 <section>
  <body>...</body>
  <figure>drawing15.jpg</figure>
 </section>
 <section>
  <body>...</body>
  <figure>screenshot85.tif</figure>
 </section>
 <section>
  <body>...</body>
 </section>
</article>');
 

-- SQL/XML (DB2 for z/OS and DB2 for Linux, UNIX, and Windows):

SELECT XMLQUERY('$doc/article/section/figure[1]'
                 PASSING mydoc as "doc")
FROM mytest;
 

SELECT XMLQUERY('$doc/article/section/figure[last()]' 
                 PASSING mydoc as "doc")
FROM mytest;
 

SELECT XMLQUERY(' ($doc/article/section/figure)[1]'
                 PASSING mydoc as "doc")
FROM mytest;
 

SELECT XMLQUERY(' ($doc/article/section/figure)[last()]'
                 PASSING mydoc as "doc")
FROM mytest;
 

SELECT XMLQUERY('$doc/article/section[figure][1]/figure[1]'
                 PASSING mydoc as "doc")
FROM mytest;
 

-- XQuery (DB2 for Linux, UNIX, and Windows only):

xquery db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section/figure[1];
xquery db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section/figure[last()];
xquery (db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section/figure)[1];
xquery (db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section/figure)[last()];
xquery db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section[figure][1]/figure[1];
xquery db2-fn:xmlcolumn("MYTEST.MYDOC")/article/section[figure][last()]/figure[last()];

In my previous post I have started discussing the XML features in mySQL and I have pointed out some of the key differences to DB2 pureXML. To complete this topic, let’s look at shredding and constructing XML data in mySQL.

Shredding with the LOAD_XML statement

The LOAD_XML command in mySQL enables you to convert XML to relational rows and to insert these rows into a table (shredding). The key restriction is that the input XML format has to be flat and regular so that the elements trivially map to a relational target table. The column names in the target table must match the tag names in the XML data. For example, if you want to shred name information to relational columns that are called “firstname” and “lastname”, then the XML input must have one of the following three forms:

–option 1:

<row firstname="John" lastname="Doe"/>

–option 2:

<row>
     <firstname>John</firstname>
     <lastname>Doe</lastname>
</row>

–option 3:

<row>
     <field name='firstname'>John</field>
     <field name='lastname'>Doe</field>
</row>

My experience is that in those XML applications where shredding is feasible, the tag names in the incoming XML data are typically not identical to the column names of the target tables. Hence, the LOAD_XML command seems too restrictive for many use cases. The shredding features in DB2 (described in the sample chapter of the DB2 pureXML Cookbook) don’t have any such restrictions and allow arbitrary mappings from XML to relational. Of course, in DB2 you may also prefer the use of XML columns instead of shredding.

Exporting relational data to a fixed XML format

The mysql and mysqldump clients support the command line option

--xml

With this option, any relational result set is tagged up in a default XML format. One <row> element is produced for each row of data, and one <field> element for each column in the result set. This is a fixed XML format and I didn’t find a way to customize it.

Constructing custom XML from relational tables

If you need to produce customized XML output with your own choice of tag names and nesting, you would normally use the XML construction functions that are part of the standard SQL language. These functions are available in DB2, but not in mySQL.

In mySQL you can try text concatenation instead, like this:

SELECT CONCAT('\n<myorder id="15">\n',
              GROUP_CONCAT('  <item>', iname, '</item>\n' SEPARATOR ''),
               '</myorder>') AS xmldoc
FROM order
WHERE oid = 15

–Result:

<myorder id="15">
  <item>Laptop</item>
  <item>Bag</item>
</myorder>

The function GROUP_CONCAT aggregates text from multiple rows, as a workaround for the lack of a proper XMLAGG function, which is defined in the SQL standard. The problem with text concatenation is that mySQL does not check whether the constructed XML is well-formed at all.

As an alternative you can use the 3rd-party library lib_mysqludf_xql. This library offers XML construction functions such as xql_element, xql_forest, xql_agg. These functions intend to mimick XML functions in the SQL language standard, such as XMLELEMENT, XMLFOREST, or XMLAGG. However, a closer look at the xql library reveals that the syntax and arguments of the xql functions differ so much from the SQL standard, that I would consider them proprietary and not open.

Neither the text concatenation functions nor the xql library enable you to explicitly construct namespace declarations and guarantee the correct usage of namespaces within constructed XML documents. In DB2, the SQL standard function XMLNAMESPACES is available to ensure correct construction of XML data with namespaces.

Conclusion

The mySQL community has begun only recently to add XML features to mySQL. Therefore the current XML support in mySQL is, in my personal opinion, still quite limited. Although this might change in the future, you may prefer to use DB2 for your XML applications. DB2 Express-C is freely available, just like mySQL. The rich XML features in DB2 enable you to avoid costly XML manipulation in application code, which leads to better performance, less code, and easier application maintenance over time.