You might have noticed that DB2 10 for Linux, UNIX, and Windows was announced earlier this month, and it contains a bunch of great new features. One of the new features that I am particularly excited about is Time Travel, which enables bitemporal data managenment in DB2.

DB2 10 also includes a variety of useful XML enhancements. Let’s take a quick look at the most important ones here:

New XML Index Data Types: Integer, Decimal

As you know, when you create an index on an XML element or attribute you must specify the data type that DB2 should use to interpret and index the specifies nodes. The reason is that the usage of XML Schema, which might define data types, is optional. And even if an XML Schema was used, it might not define a data type for each and every element and attribute.

Before DB2 10, an XML index for numeric values is specified with the data type DOUBLE, so that any type of numeric values can be indexed. In DB2 10 you can also specify the data types INTEGER and DECIMAL to define a numeric XML index.

If you know that a certain element or attribute can only contain INTEGER (or DECIMAL) values, then you can improve query performance by creating the XML index with the most appropriate type. For example, assume that dept is a table that has an XML column deptdoc. Then you can define the following indexes in DB2 10:

create unique index idx1 on dept(deptdoc) generate key using
xmlpattern ‘/dept/employee/@id’ as sql integer;

create index idx2 on dept(deptdoc) generate key using
xmlpattern ‘/dept/employee/salary’ as sql decimal(6,2);

You don’t need to change existing queries to use these new indexes!

XML Indexes for Case-Insensitive Search

DB2 10 allows you to define case-insensitive indexes and search very efficiently for string values regardless of their upper case or lower case spelling. For example, to enable case-insensitive search of employee names, you could create the following index:

create unique index idx1 on dept(deptdoc) generate key using
xmlpattern ‘/dept/employee/name/fn:upper-case(.)‘ as sql varchar(40);

This index stores all key values in upper-case. To find employees with names such as Peter Pan, PETER PAN,  Peter PAN, pEtEr pAn, etc. you could write the following query that can use the index above:

SELECT *
FROM dept
WHERE XMLEXISTS(‘$DEPTDOC/dept/employee[name/fn:upper-case(.) = "PETER PAN"]‘);

Index support for Prefix Search

Previously, the following query with the fn:starts-with() function could not use an XML index to find all employees whose name starts with “Jo”:

SELECT *
FROM dept
WHERE XMLEXISTS(‘$DEPTDOC/dept/employee[fn:starts-with(name, "Jo")]‘);

In DB2 10, this query can benefit from existing XML indexes of type VARCHAR(n) that may exist on the <name> element.

XML Indexes for Existential (Structural) Predicates

An existential predicate, sometimes called structural predicate, is a predicate that checks the existence of an XML element or attribute in an XML document, regardless of its value.

For example, imagine you want to find all documents that contain a <comment> element for an employee. One of the following queries comes to mind:

SELECT *
FROM dept
WHERE XMLEXISTS(‘$DEPTDOC/dept/employee[comment]‘);

SELECT *
FROM dept
WHERE XMLEXISTS(‘$DEPTDOC/dept/employee[fn:exists(comment)]‘);

To support such queries, DB2 10 allows you to create an XML index with the fn:exists function, like this:

create unique index idx1 on dept(deptdoc) generate key using
xmlpattern ‘/dept/employee/fn:exists(comment)‘ as sql varchar(1);

This index must be defined as VARCHAR(1) and each index key represents true or false to indicate the existence of the <comment> element in a given document.

Friendly XML query behavior

DB2 10 also includes several enhancements that make XML queries behave in a more friendly manner. Let’s look at just one of several cases. For example, consider the following query that has a numeric predicate on the @id attribute:

select *
from dept
where xmlexists(‘$DEPTDOC/dept[@id = 15]‘);

What happens if this query encounters an XML document where the @id attribute has a non-numeric string value such as “WW Sales”?  In DB2 9.x this situation would cause a type error at run-time and the query would fail and abort with an error message. Although this behavior was compliant with the language standard, it is not very useful. If you look for the department with id = 15, any department with non-numeric id can never be a match. Therefore, DB2 10 simply evaluates the predicate to FALSE in this case, i.e. ignores the document with the non-numeric id, and continues processing the query.

Binary XML client/server communication

With DB2 10, Java applications can choose the format in which XML is transmitted between the application and the DB2 server. You can choose between traditional textual XML and a new binary XML format that is a compressed on-the-wire format for XML.

On the application side, the encoding and decoding of XML to and from the binary format is performed by the JDBC driver (JCC 4.9 or higher). The JDBC driver can convert the binary XML format to and from a text,  DOM, SAX, or StAX representation. The use of binary XML provides the most notable performance gain for Java applications that already consume or produce XML in SAX format.

Applications can exchange XML with the DB2 server (through inserts, queries, etc.) and completely avoid the serialization of XML to text, i.e. neither the client nor the server needs to deal with textual XML for the purpose of insert or retrieving XML. Eliminating the parsing and serialization of textual XML can enable significantly better end-to-end performance.

Other IBM products that support this binary XML communication format include DB2 10 for z/OS, the WebSphere Application Server XML Feature Pack, and the WebSphere Datapower appliance.

Query Performance Enhancements

DB2 10 also includes a number of performance improvements for common types of XML queries, including many types of queries that use the XMLTABLE function.

What is an XML Schema? Some of you may already know this, others don’t. So before I’m going to share some more technical information about XML Schemas in subsequent blog posts, I better get some of the basics out of the way first.

When you process and manage information in XML format, you can choose to use an XML Schema with your XML documents. Roughly speaking, an XML Schema can be used to define what you want your XML documents to look like. For example, in an XML Schema you can define:

  • Which elements and attributes are allowed to occur in your XML documents
  • How the elements can be or must be nested, or the order in which the elements must appear
  • Which elements or attributes are mandatory vs. optional
  • The number of times a given element can be repeated within a document (e.g. to allow for multiple phone numbers per customer, multiple items per order, etc.)
  • The data types of the element and attribute values, such as xs:integer, xs:decimal, xs:string, etc.
  • The namespaces that the elements belong to
  • …and so on.

If you choose to create an XML Schema, it may define just some or all of the aspects listed above. The designer of the XML Schema can choose the degree to which the schema constraints the characteristics of the XML documents. For example, an XML Schema can be very loose and define only a few key features for your XML documents and allow for a lot of flexibility. Or it can be very strict to tightly control the XML data in every aspect. Or anything in between.

The use of an XML Schema is optional, i.e. an XML Schema is not required to store, index, query, or update XML documents. However, an XML Schema can be very useful to ensure that the XML documents that you receive or  produce are compliant with certain structural rules that allow applications to process the XML. In other words, XML Schemas help you to enforce data quality.

Validation

If an document complies with a given XML Schema, then the document is said to be valid for this schema. A document might be valid for one schema but invalid for another schema. The process of testing an XML document for compliance with an XML Schema is called validation.

When an XML document is parsed by an XML parser, validation can be enabled as an optional part of the parsing process. Full validation of an XML document always requires XML parsing. For many documents and schemas, validation typically incurs only a small delta cost (in terms of CPU usage) on top of the cost of XML parsing.

What does an an XML Schema look like?

An XML Schema itself is an XML document! But, a very special document that needs to comply with very specific rules that are defined by -you guessed it!- another XML Schema, i.e. the schema for schemas.

Large XML schemas can consist of multiple schema documents that reference each other through import and include relationships. This allows you to compose an XML Schema out of smaller building blocks in a modular fashion.

I don’t want to go into the syntax details of XML Schemas here, but there are some useful resources available:

When and why should I use an XML Schema?

Simply put, if you want to ensure data quality and detect XML documents that do not comply with an expected format, use an XML Schema and validate each document!

However, what if XML documents pass through multiple components of your IT infrastructure, such as a message queue, an application server, an enterprise service bus, and the database system? If these components do not modify the XML but merely read & route it, examine whether all of these components need to validate each document. For example, if the application server has already validated a document before insertion into a DB2 database, does the document need to be validated again in DB2? Maybe not, if you trust the application layer. Maybe yes, if you don’t.

An XML Schema is also often used as a “contract” between two or more parties that exchange XML documents. With this contract the parties agree on a specific format and structure of the XML messages that they send and receive, to ensure seamless operation.

Practically every vertical industry has defined XML Schemas to standardize XML message formats for the data processing in their industry. A good overview is given by the introduction of this article:

“Getting started with Industry Formats and Services with pureXML”: http://www.ibm.com/developerworks/data/library/techarticle/dm-0705malaika/

How can I validate XML documents in DB2?

Simple. First, you register one or multiple XML Schemas in the DB2 database. This can be done with CLP commands, stored procedures, or through API calls in the JDBC or .NET interface to DB2. After a schema is registered in DB2, you can use it to validate XML documents in DB2, typically when you insert, load, or update XML documents. You can enforce a single XML Schema for all XML documents in an XML column, or you can allow multiple XML Schemas per column. A database administrator can force automatic validation upon document insert, or allow applications to choose one of the previously registered schema for validation whenever a document inserted.

And… validation can also be done in SQL statements?

Yup. The SQL standard defines a function called XMLVALIDATE, which can be used for document validation in INSERT statement, UPDATE statements, triggers, stored procedures, and even in queries.

Here is a simple example of an INSERT statement that adds a row to a customer customer table, which consists of an integer ID column and an XML column called “doc”:

INSERT INTO customer(id, doc)
VALUES (?, XMLVALIDATE( ? ACCORDING TO XMLSCHEMA ID db2admin.custxsd) );

The id and the document are provided by parameter markers “?”, and the XMLVALIDATE function that is wrapped around the second parameter ensures validation against the XML Schema that has been regoistered under the identifier db2admin.custxsd.

If the inserted document is not compliant with the XML Schema, the INSERT statement fails with an appropriate error message. Similarly, the XMLVALIDATE function can also be used in the right-hand side of the SET clause of an UPDATE statement that modifies or replaces an XML document.

Ok, so much for now. In my next blog post we’ll go into more detail.

A common question is how to obtain a list of all the elements and attributes that occur in an XML document. Producing such a list is what I call “XML profiling” and in a previous blog post I have discussed several SQL/XML queries that can do this.

An extension of this question is how to get the paths of all the elements and attributes in a document. This seemingly simple task is -unfortunately- not nearly as simple as one would think! XPath and XQuery do not have a function that takes a given element or attribute as input and returns the full path to that node.

The solution is to write a query that traverses the XML document level by level to collect the element names at every level and concatenate them appropriately to construct the paths for every elements and attributes at every level.

There are many ways in which this can be done. You can use XQuery or SQL/XML and you can choose whether to use recursion or not. Let’s look at a few examples.

First, let’s create a simple table with a small document that we can use in the examples:

create table mytable(xmldoc XML);

insert into mytable values(
'<Message>
   <Type>Urgent</Type>
   <Person id ="123">
     <FirstName>Robert</FirstName>
     <LastName>Tester</LastName>
   </Person>
 </Message>');

A first and straightforward solution is to start at the root of the document, then at the first level of child nodes, and then at the children of each these child nodes, and so on. For each element or attribute we construct the path by concatenating the path from the parent with the name of the element or attribute. We do this for all nodes at a given level in the tree and then move to the next level of the document.:

xquery
for $L1 in db2-fn:xmlcolumn("MYTABLE.XMLDOC")/*
let $L1path := fn:string-join( ($L1/local-name() ),"/" )
return (
  $L1path,
  for $L2 in $L1/(*,@*)
  let $L2path := fn:string-join( ($L1path, $L2/local-name() ),"/" )
  return (
    $L2path,
    for $L3 in $L2/(*,@*)
    let $L3path := fn:string-join( ($L2path, $L3/local-name() ),"/" )
    return (

    $L3path,
    for $L4 in $L3/(*,@*)
    let $L4path := fn:string-join( ($L3path, $L4/local-name() ),"/" )
    return (

    $L4path,
    for $L5 in $L4/(*,@*)
    let $L5path := fn:string-join( ($L4path, $L5/local-name() ),"/" )
    return ($L5path)))));

Message
Message/Type
Message/Person
Message/Person/id
Message/Person/FirstName
Message/Person/LastName

6 record(s) selected.

The obvious shortcoming of this query is that it assumes a maximum of 5 levels in the document. If your documents are deeper than this, you can easily extend the query so that it goes down to 10 or 20 levels, whatever you need. That’s maybe not very elegant, but it works if you can define an upper bound on the depths of your XML documents, which is usually possible.

You probably notice that the path Message/Person/id should actually be Message/Person/@id because “id” is an XML attribute. The query can enhanced to take care of such details. In the last two sample queries of my XML profiling post you have seen how to use the self::attribute() test for this purpose.

If you prefer a more elegant solution that does not require any assumption about the maximum depths of the XML documents, then you need to code a recursive query, either in XQuery or in SQL/XML. Let’s try SQL/XML for a change.

You may already be familiar with how recursive SQL works. If not, you can look at several existing examples. The basic idea is to use a WITH clause, also called “common table expression”, that contains a UNION ALL between the start of the processing and a recursive reference back to the common table expression itself. The following  augments this approach with the XMLTABLE function that extracts nodes and node names from the XML:

WITH pathstable (name, node, xpath) AS (
  SELECT x.name AS name, x.node AS xmlnode,'/' || x.name AS xpath
  FROM mytable,
       XMLTABLE('$XMLDOC/*'
        COLUMNS
          name    varchar(30) PATH './local-name()',
          node    XML         PATH '.') AS x
  UNION ALL
  SELECT y.name AS name, y.node AS xmlnode, xpath|| '/' || y.name AS xpath
  FROM pathstable,
       XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
        COLUMNS
         name    varchar(30) PATH 'local-name()',
         node    XML         PATH '.') AS y
) SELECT name, xpath
  FROM pathstable;

NAME                           XPATH
------------------------------ -------------------------------
Message                        /Message
Type                           /Message/Type
Person                         /Message/Person
id                             /Message/Person/id
FirstName                      /Message/Person/FirstName
LastName                       /Message/Person/LastName

6 record(s) selected

If you want to list the element and attribute values for each path, then you can easily modify this query as follows:

WITH pathstable (name, node, xpath, value) AS (
  SELECT x.name AS name, x.node AS xmlnode,
         '/' || x.name AS xpath, x.value as value
  FROM mytable,
       XMLTABLE('$XMLDOC/*'
        COLUMNS
          name    varchar(30) PATH './local-name()',
          value   varchar(20) PATH 'xs:string(.)',
          xmlnode XML         PATH '.') AS x
  UNION ALL
  SELECT y.name AS name, y.node AS xmlnode,
         xpath|| '/' || y.name AS xpath, y.value as value
  FROM pathstable,
       XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
        COLUMNS
          name    varchar(30) PATH 'local-name()',
          value   varchar(20) PATH 'xs:string(.)',
          xmlnode XML         PATH '.') AS y
) SELECT xpath, value
  FROM pathstable;

XPATH                           VALUE
------------------------------- --------------------
/Message                        UrgentRobertTester
/Message/Type                   Urgent
/Message/Person                 RobertTester
/Message/Person/id              123
/Message/Person/FirstName       Robert
/Message/Person/LastName        Tester

6 record(s) selected

A few things to note:

  • The value of an XML element is defined at the concatenation of all text nodes in the subtree under that element. This explains the values that you see for /Message and /Message/Person in the example above.
  • For longer paths you may need to increase the length of the VARCHAR(n) in the XMLTABLE function.
  • In DB2 you may receive warning SQL0347W, which says that this query might recursively run into an infinite loop. But, this would only happen if your XML document was infinitely deep, which isn’t possible. So, you can safely ignore that warning.

 


					

Part 2 of our article “Data normalization reconsidered” is now available at

http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/index.html

The second part discusses alternatives to a traditional normalized relational representation of data. Such alternatives include for example XML, JSON, and RDF because they can often help you overcome normalization issues or improve schema flexibility, or both. In the 21st century, digitized business records are often created in XML to begin with, which makes XML an attractive choice as the database level storage format.

This article also contains a performance comparison between XML and relational data that was conducted for a real-world application scenario at a major international logistics company.

At the end of the article you find comparison tables that summarize the pros and cons of different data representations.

Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.

In addition to its benefits, normalization also introduces several drawbacks:

  • The insert of a single logical business record requires the insertion of multiple (often many) physical rows
  • The retrieval of a single business record requires complex multi-way joins or a series of separate queries
  • Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
  • The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.

These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:

  • Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
  • Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
  • Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
  • De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.

Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:

  • A single business record often maps naturally to a single XML document
  • A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
  • If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable

When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.

My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/index.html

The 2nd part will appear soon. Happy reading!

If you think that mainframe computers are old dinosaurs that only run ancient COBOL code – think again! Now mainframes also run XQuery!

While DB2 for Linux, UNIX, and Windows has been supporting XQuery and SQL/XML since Version 9.1 (released in 2006), DB2 9 for z/OS “only” supported XPath and SQL/XML.

I have put the word “only” in quotes because for many applications XPath and SQL/XML are fully sufficient. When you combine XPath expressions with SQL/XML functions such as XMLTABLE plus other SQL language constructs you can write very powerful XML queries and accomplish many of the same things that you can do with XQuery.

DB2 10 for z/OS has added a variety of new XML features such as node-level XML updates, XML-type parameters and variables in stored procedures and user-defined functions, and enhancements for XML Schemas and XML indexes.

With APARs PM47617 and PM47618, DB2 for z/OS now also supports XQuery within the SQL functions XMLTABLE, XMLQUERY, XMLEXISTS, and XMLMODIFY.

So what are the additional capabilities and benefits that XQuery provides? Examples include:

  • You can compose new XML structures using direct element constructors
  • You can use FLWOR expressions (for-let-where-order by-return) to iterate over and manipulate intermediate query results
  • You can join and combine information from multiple XML documents
  • You can use XQuery comma-expressions to construct and use new sequences of XML nodes
  • You can code if-then-else logic to implement conditional expressions
  • etc.

Let’s look at some examples.

Construct new XML structures with direct element and attribute constructors

The following query constructs a new order summary document from each order (XML document) that is selected from the “orders” table. New elements, such as <orderSummary> and <orderedItems> are constructed by providing the start and end tags explicitly. Similarly, the attribute orderNumber is also constructed explicitly. The content of the constructed elements and attributes is computed by XPath (or XQuery) expressions that extract selected information from each source document.

SELECT XMLQUERY('
          <orderSummary orderNumber="{$po/order/orderNo/text()}">
             <orderedItems>{$o/order/items/item/itemName}</orderedItems>
          </orderSummary>'
       PASSING orders.orderdoc AS "po")
FROM orders
WHERE ...

FLWOR expressions

The next query joins the tables “orders” and “items” on their XML columns “orderdoc” and “details”, respectively. The join predicate in the XMLEXISTS ensures that we find the items that match a given order and that we don’t produce a Cartesian product. For each pair of order document and item document, the FLWOR expression in the SELECT clause combines information from both documents into a new documents that contains the item name, the ordered quantity, and the item details.

SELECT XMLQUERY('for $o in $po/orders/items/item
                   for $i in $it/item
                 where $o/itemName = $i/name
                 return <orderdItem>
                           {$i/name}
                           {$o/item/quantity}
                           {$i/details}
                        </orderdItem>'
         PASSING orders.orderdoc AS "po", items.details as "it")
FROM orders, items
WHERE
XMLEXISTS('$po/order/items/item[itemName = $it/item/name]'
           PASSING orders.orderdoc AS "po", items.details as "it")

 

For more information on XQuery in DB2 10 for z/OS, here is a good place to continue reading:

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.xml/src/tpc/db2z_expover.htm

TPoX 2.1 has been released!

November 15, 2011

First, what is TPoX?  I have two answers to that question.

Answer 1:

TPoX, short for “Transaction Processing over XML”, is an XML database benchmark that executes and measures a multi-user XML workload. The workload contains XML queries (70%) as well as XML insert, update, and delete operations (30%). TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system.

The TPoX package contains:

TPoX has been developed by Intel and IBM, but is freely available and open source since 2007. A variety of TPoX performance results and other usage of TPoX have been reported.

Answer 2:

TPoX is a very flexible and extensible tool for performance testing of relational databases, XML databases, and other systems. For example, if you have a populated relational database you can use the TPoX workload driver to parameterize, execute, and measure plain old SQL transactions with hundreds of simulated database users. I have used TPoX for a lot of relational performance testing, because it’s so easy to setup and measure concurrent workloads. The workload driver reports throughput, min/ax/avg response times, percentiles and confidence intervals for response times, and other useful metrics. Oh, and by the way, TPoX happens to include an XML data generator and a set of sample XML transactions, in case you’re interested in XML database performance.

 

In the latest release, TPoX 2.1, we have further enhanced the extensibility of the TPoX Workload Driver. The XML data and XML transactions are still the same.

Some of the enhancements in TPoX 2.1 include:

  • higher precision in reported response times
  • proper handling and counting of deadlocks, if any
  • easier post-processing of results in Excel or other spreadsheets software
  • new types of workload parameters such as random dates, random timestamps, sequences, etc.
  • in addition to SQL, SQL/XML, and XQuery, transactions can now be also supplied as Java plugins, allowing you to run and measure anything (concurrently!) that you can code in Java, such as:
  • complex transactions that include application logic
  • calls to web services or message queues
  • obtaining data from RSS or ATOM feeds
  • transactions against databases or content repositories that do not have a JDBC interface

We have already found these extensions extremely valuable for some of our own performance testing, and we’re happy to share them. You can download TPoX 2.1 (free, open source) and find more detailed information in the release notes as well as the TPoX documentation that is included in the download.

XML in Las Vegas !

October 18, 2011

The annual Information on Demand conference in Las Vegas (Oct 23-27, 2011) is always a great venue to get updated on the latest topics in data management. The IOD conference offers a broad range of technical sessions, business sessions, hands-on labs, and lots of networking opportunities. More than 200 customer speakers will present their first-hand experiences with IBM Software.

The conference is in the Mandalay Bay conventation center. Various topics around DB2 pureXML are covered in presentations and hands-on labs. Some of them are listed here:

2088A – DB2 pureXML for Beginners
Mon, Oct 24, 2011, 3:45 PM – 5:00 PM

1120B – DB2 pureXML: Develop Your Application Prototype in Minutes, not Days
Tue, Oct 25, 2011, 9:45 AM – 12:45 PM (hands-on lab)

1930A – DB2 for z/OS SOA Solutions Powered by pureXML and DataPower
Wed, Oct 26, 2011, 3:15 PM – 4:15 PM

2691A – Architecting with IBM pureXML and Temporal Data in DB2 10 for z/OS
Thu, Oct 27, 2011, 8:15 AM – 9:30 AM

2087B – Get Connected: Publishing Relational Data as XML Messages
Thu, Oct 27, 2011, 3:30 PM – 4:30 PM

If you’re going to IOD, don’t miss these sessions!

If you look at existing documentation and examples of the XMLTABLE function, you find that most examples use simple XPath expressions inside the XMLTABLE function and rarely more complex XQuery expressions such as FLWOR expressions. This is also true for my 2 articles on XMLTABLE:

XMLTABLE By Example, Part1 and Part2:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0708nicola/
http://www.ibm.com/developerworks/data/library/techarticle/dm-0709nicola/

Why is that? Is it because XQuery expressions other than plain path expressions are never required in the XMLTABLE function? Well, let’s take a look at that question…

First, I do believe that many typical SQL/XML queries can indeed be coded with just regular XPath in the XMLTABLE function. I have seen that this is true in many real application deployments.

But, sometimes it can certainly be useful to embed more complex XQuery expressions than just XPath in an XMLTABLE function. For example, in my blog post “How to get a list of all elements and attributes” you saw that I used the XQuery comma operator to construct a sequence from two expressions. In that case the comma expression was //(*,@*) to get all elements and all attributes. In the same post you also saw the use of the XQuery if-then-else expression in the column definition of the XMLTABLE function.

Using FLWOR expression in the XMLTABLE function can be useful if you join and combine XML from two different XML columns. For example let’s consider the following two tables with te subsequent sample documents:


CREATE TABLE order(orderdetails XML);
CREATE TABLE product(description XML);

INSERT INTO order VALUES('
<order>
  <orderNo>123</orderNo>
  <item>
    <pid>24TX98</pid>
    <quantity>1</quantity>
    <price>29.95</price>
  </item> 
  <item>
    <pid>901V8</pid>
    <quantity>2</quantity>
    <price>5.00</price>
  </item> 
</order>');

INSERT INTO product VALUES('
<product pid="24TX98">
  <name>Outdoor paint</name>
  <description>
    <weight>3kg</weight>
    <color>white</color>
  </description>   
</product>');

INSERT INTO product VALUES('
<product pid="901V8">
  <name>Paint brush</name>
  <description>
    <weight>0.1kg</weight>
  </description>   
</product>');

The order table contains one XML document per order, and the product table one XML document per product. Each product has a @pid attribute as a unique identifier, and  orders have /order/item/pid elements to specify which products have been ordered. Naturally, this allows us to perform a join between orders and products.

For example, the following query is a join to retrieve the product information for all the products that have been ordered in order 123:


SELECT product.description
FROM order, product
WHERE XMLEXISTS('$ORDERDETAILS/order[orderNo = 123]')
  AND XMLEXISTS('$DESCRIPTION/product[ @pid/fn:string(.) =
                   $ORDERDETAILS/order/item/pid/fn:string(.) ]');

This query returns one row (one XML document) for each product that is referenced in order 123.

Now, what if you want to combine data from an order document and the associated product documents, and return this information in one relational row per product?  For example, assume you want rows that show the product ID and product weight (from the product documents) as well as the price and quantity of that product in the order document.

In that case you could use an XQuery FLWOR expression to perform the join and combine the product and order information in the XMLTABLE function, as shown in the following query.

Since the product and order tables appear in the FROM clause of the query, the FLWOR expression references their XML columns through the variables $DESCRIPTION and $ORDERDETAILS. The return clause of the FLWOR expression constructs XML fragments that combine the desired elements and attributes from each matching pair of product and order documents. The constructed <result> elements are then input to the COLUMNS clause where they are broken up into relational columns.


SELECT T.*
FROM order, product,
     XMLTABLE('for $o in $ORDERDETAILS/order[orderNo = 123]/item
                for $p in $DESCRIPTION/product
                where $p/@pid/fn:string(.) = $o/pid/fn:string(.)
                return
                    <result>
                        {$p/@pid}
                        {$p/description/weight}
                        {$o/quantity}
                        {$o/price}
                    </result>'
         COLUMNS
             prodid VARCHAR(15)  PATH '@pid',
             weight VARCHAR(5)   PATH 'weight',
             qty    INTEGER      PATH 'quantity',
             price  DECIMAL(6,2) PATH 'price')
         AS T;

   
PRODID          WEIGHT QTY         PRICE
--------------- ------ ----------- --------
24TX98          3kg              1    29.95
901V8           0.1kg            2     5.00

  2 record(s) selected.

It is worthwhile noting that you can produce the same result set without the use of FLWOR expressions, as shown in the next query. This query uses two XMLTABLE function, one for the desired order information and one for the desired product information. The key trick is that the two XMLTABLE function are joined on the product ID using the predicate [@pid = $p]. Without this predicate the two XMLTABLE functions would produce a Cartesian product, which is not desired.


SELECT T2.prodid, T2.weight, T1.qty, T1.price
FROM order, product,
     XMLTABLE('$ORDERDETAILS/order/item'
        COLUMNS
           prodid VARCHAR(15)  PATH 'pid',
           qty    INTEGER      PATH 'quantity',
           price  DECIMAL(6,2) PATH 'price')
     AS T1,
     XMLTABLE('$DESCRIPTION/product[@pid = $p]'
                                  PASSING T1.prodid AS "p"
        COLUMNS
           prodid VARCHAR(15)  PATH '@pid',
           weight VARCHAR(5)   PATH 'description/weight')
     AS T2;

 
                  
PRODID          WEIGHT QTY         PRICE
--------------- ------ ----------- --------
24TX98          3kg              1    29.95
901V8           0.1kg            2     5.00

  2 record(s) selected.

The join predicate between the two XMLTABLE functions can also be placed in the SQL WHERE clause:


SELECT T2.prodid, T2.weight, T1.qty, T1.price
FROM order, product,
     XMLTABLE('$ORDERDETAILS/order/item'
        COLUMNS
           prodid VARCHAR(15)  PATH 'pid',
           qty    INTEGER      PATH 'quantity',
           price  DECIMAL(6,2) PATH 'price')
     AS T1,
     XMLTABLE('$DESCRIPTION/product'
        COLUMNS
           prodid VARCHAR(15)  PATH '@pid',
           weight VARCHAR(5)   PATH 'description/weight')
     AS T2
WHERE T1.prodid = T2.prodid;       

However, this query applies the join predicate after the XMLTABLE functions have produced their rows, which can be slower than the previous query where the predicate was in the XMLTABLE function itself.

You can find many more examples of the XMLTABLE function, XML joins, and joins between XML and relational columns in Chapter 9 of the DB2 pureXML Cookbook.

Undeniably, XML is the message format of choice for many service-oriented architectures and application integration efforts. Also, many SOA and web service implementaions use REST as the protocol for accessing URL-addressable resources and services. REST stands for Representational State Transfer and is built on top of HTTP, which acts as the underlying transport layer.

With a new set user-defined functions (UDFs) in DB2, it has become very simple to issue REST requests directly from your SQL statements or DB2 stored procedures. This enables DB2 to easily interact with REST-based services and integrate information from the web or URL-based resources into the database.

The new REST UDFs allow you to receive and provide information in binary format (BLOB) or textual format (CLOB), which includes but is not limited to XML. The basic REST UDFs are scalar functions that perform the simple HTTP operations GET, POST, PUT, and DELETE. These UDFs are:

  • DB2XML.HTTPGETCLOB
  • DB2XML.HTTPPOSTCLOB
  • DB2XML.HTTPPUTCLOB
  • DB2XML.HTTPDELETECLOB
  • DB2XML.HTTPGETBLOB
  • DB2XML.HTTPPOSTBLOB
  • DB2XML.HTTPPUTBLOB
  • DB2XML.HTTPDELETEBLOB

Each of these functions take a URL and, optionally, an HTTP header as input paramaters. Additionally, the POST and PUT functions have a third parameter for the BLOB or CLOB data that you want to upload.

The REST UDFs are implemented in Java and support HTTP as well as HTTPS with SSL encryption.

When you use these UDFs in SQL statements to receive information, you can:

  • insert the information into DB2 tables
  • use the information as a parameter to search for related data in DB2
  • join the information with existing data in DB2
  • process the information in a stored procedure
  • or use the information in any other way that SQL provides

In particular, if the information is in XML format you can apply any DB2 pureXML functions to it, such as:

  • the XMLQUERY function to extract pieces from the XML message
  • the XMLTABLE function to split the XML into smaller XML pieces
  • the XMLTABLE function to shred the XML into relational tables
  • the XMLVALIDATE function to validate the XML against an XML Schema
  • etc.

If you are interested and want to try this for yourself, read the following article:
“Accessing HTTP and RESTful services from DB2: Introducing the REST user-defined functions for DB2″
http://www.ibm.com/developerworks/data/library/techarticle/dm-1105httprestdb2/

This article describes the REST functions for DB2 in more detail and presents several concrete usage examples. The UDFs themselves are available for download at the bottom of this article.

Enjoy your REST !

How much protein is in a steak? Well, a 6-ounce steak can contain about 38 grams worth of protein.

How much protein does salmon contain? Approximately 34 gram of protein in a 6 ounce piece of salmon.

How much protein can you find in lentils? A cup of cooked lentils has 18 rams of protein.

And how much protein can you find in DB2 pureXML? Tons! Hundreds of Gigabytes and soon over a 1TB of proteins!

It’s true, DB2 pureXML is an excellent source for proteins. DB2 can store, compress, and index the proteins so that you always find the right ones quickly when you need some.

What distinguishes DB2 from a steak is that DB2 stores all proteins in XML  or in a hybrid XML/relational format! This enables members of the biological research community to analyze and compare the structure and composition of protein molecules. The findings can help them explain diseases, develop new drugs, or understand the interactions between different proteins.

The protein data is publicly available from the Protein Data Bank (PDB) which is world-wide repository of structural protein data. To facilitate data exchange and flexibility, the data is available in XML.

Typically one XML document describes a single protein molecule. Such a document includes detailed information about all the atoms that the protein consists of, which can be hundreds or sometimes thousands of atoms. And to represent the structure of the protein, the 3-dimensional spatial coordinates of each atom are included as well, resulting in XML documents that can be hundreds of MB in size to describe a single protein.

Searching and analyzing such amounts of complex information is a challenge. To tackle this challenge, researchers in Germany decided to harness the hybrid XML/relational capabilities of DB2 to efficiently store and query the protein data. I was happy to assist them to get the most out of DB2 pureXML.

The article “Managing the Protein Data Bank with DB2 pureXML” describes the database design and optimization that facilitate protein data analysis in a scalable manner. Even if you’re not a biologist (I’m not!), the article is an interesting cases study in how a real-world data management problem has been mapped to the hybrid features of the DB2 database system.

If you are familiar with SQL you have probably seen quantified SQL predicates that compare a single value with a set of values. For example, the following SQL query returns all rows from table1 where col1 is greater than all values in col2 of table2.

SELECT *
FROM table1
WHERE col1  > ALL (SELECT col2 FROM table2)

And if you replace the keyword ALL with the keyword SOME, then the query above returns all rows from table1 where col1 is greater than at least one of the values in col2 of table2.

These comparisons are also known as universal quantification (“ALL”) and existential quantification (“SOME”), respectively.

Similar quantified comparisons are possible in XQuery too, but the syntax is slightly different. In XQuery you have the keywords “every … satisfies ….” for  universal quantification and “some … satisfies …” for existential quantification. Let’s use the following table and two simple sample documents to explore how they work:

create table customer(doc XML);

insert into customer values(‘
<customer id=”1″>
    <phone type=”cell”>408-516-4963</phone>
    <phone type=”cell”>408-087-1234</phone>
    <phone type=”cell”>408-111-222</phone>
</customer>
‘);

insert into customer values(‘
<customer id=”2″>
    <phone type=”home”>408-516-4963</phone>
    <phone type=”cell”>408-087-1234</phone>
    <phone type=”office”>408-111-222</phone>
</customer>
‘);

Now assume that we want to find customers where all their phone numbers are of type “cell”. We can code a corresponding query in XQuery or in SQL/XML notation as follows:

xquery
for $c in db2-fn:xmlcolumn(“CUSTOMER.DOC”)/customer
where every $p in $c/phone satisfies $p/@type = “cell”
return $c;

select *
from customer
where xmlexists (‘$DOC/customer[ every $p in ./phone satisfies $p/@type = "cell"]‘);

Both queries return just the first of the two customer documents. The second document is not selected because not all of its phone elements have a @type attribute with the value “cell”.

The quantified expression every $p in $c/phone satisfies $p/@type = “cell” means that $p iterates over all the items in the sequence $c/phone,  and for each such $p the predicates $p/@type = “cell” must be true.

If we want to return all documents where at least one of the phone number is of type “cell” (existential quantification!), then we can simply replace the keyword “every” with the keyword “some”, such as in the following query:

select *
from customer
where xmlexists (‘$DOC/customer[ some $p in ./phone satisfies $p/@type = "cell"]‘);

This query returns both our sample documents, because both documents contain at least one phone element where the @type attribute has the value “cell”. Since XPath uses existential quantification by default, you can often omit the “some … satisfies …” and obtain the same result with an ordinary path expression, like this:

select *
from customer
where xmlexists (‘$DOC/customer[phone/@type = "cell"]‘);

For more information, use the following resources.

DB2 Information Center – Quantified Expressions:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/xqrquanexp.html

XQuery specification of quantified expressions:
http://www.w3.org/TR/xquery/#id-quantified-expressions

Generating XML from relational data is not difficult. The SQL standard contains a set of XML construction functions that take column names or other SQL expressions as input and return XML. The most prominent of these XML construction functions are:

  • XMLELEMENT (constructs an XML element)
  • XMLATTRIBUTES (constructs one or more XML attributes)
  • XMLFOREST (constructs a sequence of XML elements)
  • XMLTEXT (constructs an XML text node)
  • XMLAGG (aggregates data from multiple rows into a single XML fragment)
  • XMLNAMESPACES (constructs a namespace declaration)
  • etc.

Here is a simple example:

create table mydata(c1 int, c2 varchar(25));
insert into mydata values (1, ‘This is a test’);

select xmlelement(name “myroot”,
               xmlelement(name “col1″, c1),
               xmlelement(name “col2″, c2) )
from mydata;

<myroot><col1>1</col1><col2>This is a test</col2></myroot>

 1 record(s) selected.

The XML is typically returned in textual (serialized) format, as a single line without any line breaks. That’s perfectly fine, except that line breaks can sometimes be useful to make the XML easier to read for the human eye.

There are several ways in which you add add newline characters to the XML to make it more readable.

If the XML that you construct is less then 32KB in size, then you can use the SQL function REPLACE to replace every occurrence of ‘><’ with the hex string 3E0A3C. The hex character 3E is the same as ‘>’, the hex character 3C is the same as ‘<’, and the hex character 0A is a line feed in between them. Before we can apply the REPLACE function to the constructed XML, we need to explicitly convert the  XML to data type VARCHAR, which can be done with the function XMLSERIALIZE.

The following query is extended from the previous example by adding the functions XMLSERIALIZE and REPLACE:

select replace(
       xmlserialize(
          xmlelement(name “myroot”,
               xmlelement(name “col1″, c1),
               xmlelement(name “col2″, c2) ) AS VARCHAR(32000)),
         ‘><’,
         x’3E0A3C’)       
from mydata;

<myroot>
<col1>1</col1>
<col2>This is a test</col2>
</myroot>

 1 record(s) selected.

The replace function takes three parameters:

  • the constructed serialized XML
  • the string that should be searched and replaced, i.e. ‘><’
  • the new string that replaces the search string, i.e. x’3E0A3c’

However, this approach works only for smallish XML, up to 32KB.

If you generate larger XML, you have two options. First, you can perform the replace operation at the client that receives the XML. Second, you can explicitly construct XML text nodes that contain newline characters where you want them. For example:

select xmlelement(name “myroot”,
              xmltext(x’0A’),         
              xmlelement(name “col1″, c1),
              xmlelement(name “col2″, c2),
              xmltext(x’0A’) )
from mydata;

<myroot>
<col1>1</col1><col2>This is a test</col2>
</myroot>

 1 record(s) selected.

 

select xmlelement(name “myroot”,
              xmltext(x’0A’),         
              xmlelement(name “col1″, c1),
              xmltext(x’0A’),
              xmlelement(name “col2″, c2),
              xmltext(x’0A’) )
from mydata;

<myroot>
<col1>1</col1>
<col2>This is a test</col2>
</myroot>

 1 record(s) selected.

If you want a line break after every XML element then this approach can be tedious and makes the SQL statement quite a bit longer. Then again, the line breaks are really only needed for human readability, not for any client application that consumes the generated XML.

 

In my previous post I discussed the native XML capabilities provided by the IBM WebSphere Application Server V7.0 Feature Pack for XML. This Feature Pack allows you to manipulate XML in your Java application directly with XQuery, XPath, or XSLT.

If you use the WebSphere XML Feature Pack together with DB2 pureXML, one of the obvious questions is: when should I use XPath or XQuery in the application server and when should I push these operations to DB2 and run XQuery or SQL/XML in the database layer?

The best split between XML processing in one tier vs. the other certainly depends on the exact nature of your application and may vary on a case by case basis. Still, here are some general considerations and examples.

Example 1:
Assume your application needs to retrieve an XML document from the database based on one or multiple XML search conditions, but the decision about which information to extract from the document can only be made at runtime after the document has been retrieved. In this situation it is typically best to issue an XQuery or SQL/XML query to DB2 to retrieve the full document, exploiting DB2′s XML indexes to evaluate the search conditions efficiently. The extraction of specific elements is then performed with XPath or XQuery in the application server as part of the application’s business logic.

Example 2:
As a variation of Example 1, consider a hybrid database schema where some data is in relational columns and some in XML columns. Assume you need to retrieve XML documents based on relational and XML predicates, possibly involving joins between the table that holds the XML and other fully relational tables. It is typically recommended to use a database query that contains both relational and XML predicates to minimize the amount of data that is retrieved from the database and sent to the application. The extraction of XML elements would still happen in the application server, especially if it depends on post-retrieval application logic.

Example 3:
The application needs to retrieve specific elements from one or multiple XML documents, and these elements are fixed and known before the database query is submitted. In this case it is typically best to code the search and the extraction in a single XQuery or SQL/XML statement for DB2. Again, the general idea is to reduce data early and not ship more data around than necessary.

Example 4:
The application receives an XML document from a web service and needs to examine it with an XPath query before it can decide whether the document should be inserted into the database or passed to another service. Typically, this operation should be implemented in the application server instead of calling out to the database when it might be not needed.

Example 5:
The application receives a transient XML document from a web service and based on specific element values in that document the application needs to retrieve related documents from the DB2 database . Although you could pass the full input document as a parameter to a SQL/XML database query and perform the entire processing in DB2, it can be easier and more efficient to first extract the relevant values from the input document in the application layer and then send a simpler and more targeted query to DB2, for potentially better end-to-end performance.

Example 6:
The application needs to retrieve XML documents from the database and apply logic that has already been coded in XQuery modules and XQuery user-defined functions. Such modules are an optional feature of the XQuery language standard. Modules are supported in the WebSphere XML Feature Pack but not in DB2. Obviously, the choice is to retrieve the full documents from DB2 (similar to Examples 1 and 2) and then use the additional XQuery capabilities of the WebSphere XML Feature Pack on these documents.

 

These examples are certainly not an exhaustive of list potential scenarios. Also, they are not meant to be hard rules, but just illustrations of the trade-offs and considerations that can be involved in the decision process when you split XML processing between the application server and the database.

 

For concrete coding examples with the WebSphere XML Feature Pack and DB2 pureXML, see the following articles:

“Programming XML across the multiple tiers: Use XML in the middle tier for performance, fidelity, and development ease” – http://www.ibm.com/developerworks/library/x-xmlfeat1/

“Programming XML across the multiple tiers, Part 2: Write efficient Java EE applications that exploit an XML database server” – http://www.ibm.com/developerworks/xml/library/x-xmlfeat2/

 

Native XML support in a database such as DB2 allows you to store, index, query, validate, and update XML documents efficiently. In particular it’s the database support for XQuery and SQL/XML that enables powerful queries and updates of stored XML data, or XML and relational data combined.

However, XML capabilities in the database alone are often not sufficient to manage XML across the enterprise. Integrated XML support is also required in application development tools, ETL products such as DataStage, reporting tools like Cognos, and other critical parts of the IT landscape, such as the application server.

To that end, the WebSphere XML Feature Pack provides developers with key XML technologies to efficiently manipulate XML in WebSphere Java applications. It offers support for:

  • XPath 2.0
  • XQuery 1.0
  • XSLT 2.0
  • and earlier versions of XPath and XSLT, if needed

In essence, you can write Java applications for deployment in WebSphere Application Server and within your Java code use XPath, XQuery, or XSLT directly to work with XML, including transient XML messages. One of the key advantages is that you can apply XPath and XQuery expressions directly to your XML data instead of coding equivalent operations much more tediously via a low-level SAX or DOM API or other object models. This has several benefits:

  • Simplified XML application development
  • Increased developer productivity
  • XML operations are easier to understand and maintain over time

If you think about it, these are the same reasons why you implement many applications in Java (or similar) rather than Assembler: you use the comfort and expressiveness of a higher level language (XQuery) to avoid the complexity, tediousness, and “error proneness” of a lower-level approach.

A second set of benefits emerges when you look at the handling of XML across database and application server. On the database side, the benefits of storing and processing XML natively instead of shredding them to relational format are well-known:

  • Reduced complexity, because shredding any but simple XML documents is a complicated task
  • Increased flexibility, because native XML allows for document variability and schema evolution, whereas shredding relies on a static mapping that breaks easily when new document variations are encountered
  • Improved performance, because the conversion between two different data representations is avoided
  • Guaranteed document fidelity, because the original XML business record stays in its XML format

The very same benefits can be  realized in the middle tier by processing XML natively in the application server. With the XQuery, XPath, and XSLT support in the WebSphere XML Feature Pack you can manipulate XML in its original XML format without having to convert it to a different representation, such as an object structure.

With increasing complexity or variability of your XML documents, converting XML to Java objects becomes increasingly complicated, inflexible, and slow. For example, using JAXB to map XML to objects requires Java classes to be generated that constitute a fixed and inflexible mapping.

And finally, treating XML as XML in the database and the application server realizes all of the above-mentioned benefits across the middle tier and the database by eliminating any impedance mismatch and data conversion between the tiers.

Here are some resources where you can learn more about the IBM WebSphere Application Server V7.0 Feature Pack for XML and see some concrete examples of its use.

Product page and download:
http://www.ibm.com/websphere/was/xml

The IBM Red Paper: “Getting Started with the WebSphere Application Server Feature Pack for XML”, includes many examples:
http://www.redbooks.ibm.com/abstracts/redp4654.html?Open

The XML Feature Pack documentation:
http://publib.boulder.ibm.com/infocenter/wasinfo/fep/index.jsp?topic=/com.ibm.websphere.xmlfep.multiplatform.doc/info/ae/ae/welcome_fepxml.html

And a collection of links to further information:
http://webspherecommunity.blogspot.com/2009/11/websphere-application-server-feature.html

Relational databases have supported SQL stored procedures for a log time. The benefits of SQL procedures are well-known and include:

  • Reduced coding labor due to the creation of reusable processing modules
  • Richer processing capabilities in the databases by defining custom data processing logic
  • Improved performance and reduced network traffic because stored procedures are executed close to the data, i.e. in the database engine itself

With the introduction of XML support in relational databases, several database products (such as DB2) allow you to access and manipulate not only relational data but also XML in stored procedures. This is made possible through:

(a) the introduction of the XML data type in the SQL type system, and

(b) several XML functions in the SQL language such as XMLQUERY, XMLEXISTS, and XMLTABLE that can contain XPath or XQuery expressions.

The picture below provides an example of some of the basic capabilities for XML handling stored procedures:

  • Input and output parameters of type XML. In DB2, if an XML document already exists in a parsed format in the database then passing it to the XML parameter of the procedure does not require the document to be reparsed. XML procedures can operate directly on the parsed representation of the XML. Also, if an application calls a stored procedure and passes a textual XML document to the XML parameter, the document gets parsed only once upon entering the procedure regardless of how often the XML parameter is used in the body of the procedure.
  • XML type variables. XML variables can be assigned values of type XML, such as a document or some document fragment, even a single XML element or atomic value. XML variables hold XML in a parsed format, which is important for performance. For example, the second and last SET statement in the picture below uses an XQuery FLWOR expression to read a certain address from the input document and assigns it to the XML variable “address”.
  • XML value extraction. The XMLQUERY function can contain XPath or XQuery expression to extract values from a document. In the first SET statement in the picture below, the XMLQUERY function on the right-hand side takes the XML input parameter “custDoc” as context and uses an XPath to extract the @Cid attribute from the document. The XMLCAST function converts the attribute to the SQL type INTEGER, so that it can be assigned to the SQL variable “id”.
  • XML conditions. If-then-else statements in the SQL procedure language can use XML conditions in the IF-clause to decide what to do next in the control flow of the stored procedure.

(Click the picture for full size)

The examples above are certainly not an exhaustive list of what you can do with XML in stored procedure. For example, while the stored procedure above only operates on a single document that is passed in via the XML parameter, stored procedures can also use XQuery or SQL/XML to access sets of XML documents that are stored in XML columns.

Stored procedures can also compose, prepare, and execute XQuery or SQL/XML statements dynamically. You can define cursors over the result sequence of an XQuery and process each returned item one by one in a loop. You can also construct, split, shred, modify, merge, compare, or validate XML documents in stored procedures.

Of course, all the error and exception handling that is available for stored procedures can also be used with any XML operations. In summary, you can code sophisticated XML processing logic (or hybrid XML/relational processing logic) with the familiar concepts of SQL stored procesures.

Many more examples of XML processing in DB2 Stored Procedures can be found in Chapter 18 of the DB2 pureXML Cookbook.

XML is a flexible and extensible data format and is often used for messages or other document types whose structure can have multiple different variations or evolves over time. As a result, you might not be dealing with a single fixed document schema but with significant schema variability, or schema diversity.

Many XML database systems allow you to store documents for many different schemas (or different version of the same schema) in a single collection. This is also true for DB2, where a collection is an XML column that is not necessarily tied to a single XML schema. Hence, applications can store, index, query, validate, and update documents for different schemas in a single XML column.

When you validate different documents in an XML column against different XML schemas, DB2 remembers which document was validated against which schema. When you later retrieve a specific document, you can easily determine which schema it was validated against when it was inserted or last updated.

A common question is: How to effectively index XML documents when there is schema variability or schema diversity?

At a high level, DB2 provides three options:

  • Define various individual XML indexes for the various elements and attributes that you require to be indexed for good performance
  • Index all elements and/or all attributes
    • Creating an XML index on the path //text()   indexes all text nodes, i.e. basically all element values
    • Creating an XML index on the path //@*  indexes all attributes
  • Use a full-text index that indexes all elements and attributes, using DB2 Text search

The good thing is that DB2 allows you to “index everything” but it doesn’t force you to do so – unlike some other database systems. This is important, because indexing all elements and all attributes in all documents requires a lot of space and impacts the performance of insert, update, and delete operations. Therefore, indexing everything should be a conscious choice.

Although full-text indexing with DB2 Text search can address schema variability, I think the decision to use Text Search should be predominantly based on the type of searches that need to be performed. If the applications commonly perform full-text search for phrases and substrings, or maybe even require fuzzy search, stemming, and other fancy text features, then DB2 Text Search should be used. However, if the data is more structural than textual in nature, and if most predicates use regular =, >, >=, etc. operators, then I would usually prefer a set of regular XML indexes.

A key benefit of XML indexes in DB2 is that they contain index entries only for those document that they match. For example:

Let’s say we have 10 Million XML documents in an XML column, for 10 different schemas, approx. 1M documents per schema. If we define an XML index on an XML element (XPath) that exists in only 1 of the 10 schemas, then this index will hold entries only for 1M documents. If this element is an optional element that occurs in only 20% of the documents for this 1 schema, then this index will hold entries only for 200,000 documents. Thus, this index will be substantially smaller than an index that contains entries for all documents.

As a result, defining a number of specific XML indexes for various elements and attributes is very space efficient! So, if you can identify which nodes need to be indexed for good performance, you can index them in a space-conscious manner.

The index maintenance cost for insert/update/delete operations follows the same guideline as in the relational world: the more indexes you define the more CPU and I/O is being used for update the indexes. But, let’s say I have defined 50 XML indexes on an XML column, and the document that I am inserting contains elements and attributes that match only 5 of those indexes, then only those 5 indexes will be updated when the document is inserted.

A large number of XML indexes defined on a single XML column is not a concern for query performance. The DB2 query compiler is very efficient at determining quickly which index can (and should) be used for a given query.

Also note that XML indexes in DB2 can be defined with wildcards, which gives you a lot of option between defining an index for just one element and defining an XML for all elements. Some examples:

  • Let’s say you are indexing purchase orders and you know that all of most the customer address information should be indexed. If you know that the path to the customer address information is /purchaseorder/customer/address  but the subtree under the <address> element is subject to variability, you could define an index on /purchaseorder/customer/address//text(), to index all element values in the subtree under <address>.
  • If you know that there is only one level of elements under address, you could define an index on /purchaseorder/customer/address/* to index all of these elements, and maybe another index on /purchaseorder/customer/address/*/@* for the attributes of these elements, if any.
  • If you know that you need to index the <zipcode> element under <address>, but the <address> can be in different subtrees for different documents, you could define an index on //address/zipcode or  /purchaseorder//address/zipcode.

For more information on XML indexes, see

http://www.ibm.com/developerworks/data/library/techarticle/dm-0611nicola/

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.xml/src/tpc/db2z_indexxml.htm (DB2 for z/OS)

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0024039.html (DB2 for Linux, UNIX, Windows)

A short summary of why and how the New York State Department of Taxation & Finance is using DB2 pureXML is available here:

http://smarterquestions.org/2011/06/a-true-story-about-using-xml-forms-and-db2-to-process-millions-of-tax-returns/

Many of the reasons why XML is useful for forms processing also apply to other forms-based application, e.g. in healthcare, insurance, government, and other industries. Most cases of forms-processing share the following characteristics:

  • Many different types of forms are being used, each one having a different set of fields. (This is schema diversity!)
  • Some forms tend to change over time, usually to support new or changed business processes or regulatory requirements. Simply put, fields on the forms get changed, added, or removed. (This is schema evolution!)
  • When information is filled into any given form, there are typically many optional fields that can -and often will- remain blank. (This is sparsely populated data!)

As it turns out, XML is very well suited to handle schema diversity, schema evolution, and sparse data. Optional XML elements and attributes that may or may not appear in a given document are convenient to handle sparse data, whereas in a relational model sparse data often leads to large numbers of NULL values.

XML also supports schema flexibility, e.g. documents for different schemas (or different versions of the same schema) can be stored, indexed, and queried in a single XML column.

And finally, if most of the application operations touch one form at a time, e.g. insert a form, retrieve a form, update a form, validate a form, and so on, then an XML solution where each form is a single XML object typically also provides significant performance benefits over a normalized relational schema for the same information.

 

 

In DB2 for Linux, UNIX, and Windows, the quickest and easiest way to produce XML from relational tables is to use the function XMLROW or XMLGROUP. The following examples illustrates how this works.

Let’s first define a relational table and insert a couple of rows:

create table employee(id int, name varchar(30), dept char(3));

insert into employee values (1, ‘John’, ‘PX8′);
insert into employee values (2, ‘Mary’, ‘KH2′);

select id, name, dept  from employees;

ID             NAME                          DEPT
———– —————————— ————————
          1     John                               PX8
          2    Mary                               KH2

  2 record(s) selected.

Now, let’s select the same columns with the XMLROW function:

select xmlrow(id, name, dept) from employee;

<row><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></row>

<row><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></row>

  2 record(s) selected.

You can see that each row is returned as an XML document, using column names as element names. And you can further customize the XML if you like. For example, you can rename the root element for each row:

select xmlrow(id, name, dept OPTION ROW “EMPL”) from employee;

<EMPL><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>

<EMPL><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>

  2 record(s) selected.

You can also rename the elements for individual columns, if you want:

select xmlrow(id AS “EMP_NO”, name, dept OPTION ROW “EMPL”) from employee;

<EMPL><EMP_NO>1</EMP_NO><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>

<EMPL><EMP_NO>2</EMP_NO><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>

  2 record(s) selected.

And if you prefer to combine all selected rows into a single XML document, you would use XMLGROUP instead of XMLROW:

select xmlgroup(id, name, dept OPTION ROW “EMPL”) from employee;

<rowset>
  <EMPL><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>
  <EMPL><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>
</rowset>

 1 record(s) selected.

For more options and information on these functions, see:

XMLROW:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0050587.html

XMLGROUP:

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0050588.html

SQL/XML publishing functions for constructing XML values:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0023909.html

In my previous blog post I promised that I would provide some examples of how to insert information from multiple relational tables and multiple rows into an XML document.

So, let’s continue the example from my previous post where we had a table with relational INTEGER and DATE columns as well as an XML column. For simplicity, we now use only one XML document in this XML column, but it could be many more:

DROP TABLE documents;
CREATE TABLE documents(document_id INTEGER, created DATE, document XML);

INSERT INTO documentsVALUES(2, '2011-05-21',
'<mydoc>
   <header>
     <title>Introduction to XML</title>
     <author id="23">Peter Pan</author>
     <author id="49">Susie Swing</author>
   </header>
   <body>
     <info>What is XML?</info>
   </body>
 </mydoc>');

Let’s add a second table “document_reviews” that contains a rating and review for each document. Documents and reviews are related to each other and can be joined based on the document_id column:


CREATE TABLE document_reviews(document_id INTEGER,
                              rating CHAR(2),
                              review VARCHAR(32000) );

INSERT INTO document_reviews VALUES(2, 'A+', 'This books is...');

Now we want to to read the XML document from the “documents” table and insert the corresponding rating information from the “document_reviews” table into the XML. This task is quite easy, using the techniques described in my previous post and simply using a join between the two tables:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert <rating>{$RATING}</rating>
                           into $new/mydoc/header
                 return $new' )
FROM documents, document_reviews
WHERE documents.document_id = document_reviews.document_id;

Note that the XQuery variable $RATING refers to the “rating” column in the document_reviews table. And the result of this query is the original document with the desired <rating> element inserted into the <header>:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

1 record(s) selected.

Now it gets more interesting. What would the query above return if there were *multiple* review rows for this one document? Well, let’s find out! Let’s insert a second review for the same document_id and run the query again.

INSERT INTO document_reviews VALUES(2, ‘C’, ‘I did not like…’);

After inserting this additional review, our query above returns the XML document twice, once for each of the two reviews:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>C</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

2 record(s) selected.

The first document in the result set contains a <rating> element with the rating from the first review, and the second document contains a <rating> element according to the second review. The original XML document is duplicated for each rating, due to the nature of the relational join between the two tables.

Although this form of output might be useful in some cases, chances are that you would like all ratings for this document to be included in a single copy of the document. This can be achieved if we convert the join into a correlated subquery that produces a sequence of <rating> elements that are all inserted into a single copy of the document:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert db2-fn:sqlquery("
                             SELECT XMLELEMENT(NAME ""rating"", rating)
                             FROM document_reviews
                             WHERE document_id = parameter(1)" ,
                             $docid            )
                           into $new/mydoc/header
                 return $new'
                 PASSING documents.document_id as "docid" )
FROM documents;

In this query, the modify clause contains the function db2-fn:sqlquery that takes an SQL query and optional parameters as input. This SQL query reads the rating column from the document_reviews tables and constructs a sequence of <rating> elements.

The WHERE clause of this subquery is parameterized, using $docid as the parameter value (line 5). The $docid is passed into the XQuery context from the documents.document_id column (line 9), and ensures that the subquery only reads ratings for correct document. As a result, a single document with two <rating> elements is returned:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
    <rating>C</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

1 record(s) selected.

I think these example provided a good idea of the powerful capabilities that exist to combine XML and relational data.

Follow

Get every new post delivered to your Inbox.

Join 28 other followers