It’s been a little bit over one year now since we published the DB2 pureXML Cookbook. Many copies have been sold and we have received a lot of positive feedback from many readers (thanks!).

When you write a book with more than 700 coded examples of SQL/XML, XQuery, updates, XML Schemas, indexes, stored procedures, triggers, and application code samples in Java, C, COBOL, etc., it’s inevitable that some errors creep in and remain undetected before the book goes into print.

Luckily we have some very observant readers that are using this book in their DB2 pureXML projects and are paying a lot of attention to detail. They have discovered several mistakes in the book. Fortunately, most of these errata are quite small and syntax-related glitches rather than major disasters.

The errata that we currently know about are listed below, and we will continue to list any further errata on the web page at
http://tinyurl.com/pureXML.

By the way, IBM Press has a promotion where this and other books are currently available at a 35% discount.

Errata:

Page 216/217:
On the bottom of page 216, in Figure 8.41, 3rd line: “AS id” should be “AS idstatus” to match the query output on the top of page 217.

Page 283:
If we want the query in Figure 10.24 to return the exact same result as the query in Figure 10.23, two attribute names would have to be changed in Figure 10.24. That is, it should be

XMLATTRIBUTES(po.poid as “oid”),

instead of:

XMLATTRIBUTES(po.poid as “id”),

and

XMLATTRIBUTES(promoprice as “promoprice”),

instead of:

XMLATTRIBUTES(promoprice as “lowprice”),

Page 370:
The two queries in Figure 13.12 can be improved. The first query should cast to Varchar(500) instead of Varchar(50), to be on the safe side in case some titles are long. The second query in Figure 13.12 should cast the string-length to INTEGER instead of Varchar(500).

Page 388:
In Table 13.6 and 13.7 the paths for the attribute “@isbn” are wrong. The @isbn attribute is an attribute of the book element, not of the title element. So the correct path is /book/@isbn .

Page 637:
In Figure 21.15, the parameter index in the 2nd SQLBindParameter call should be 2, not 1. So, the bind call should be

SQLBindParameter(stmt, 2, SQL_PARAM_INPUT, SQL_C_BINARY, SQL_XML, 0, 0, xmldoc, 32000, &length);

instead of:

SQLBindParameter(stmt, 1, SQL_PARAM_INPUT, SQL_C_BINARY, SQL_XML, 0, 0, xmldoc, 32000, &length);

Page 642:
In Figure 21.19, the “PASSING” clause in the XMLEXISTS predicate is incomplete. It should be

PASSING info AS “i”, CAST(:cid AS INTEGER) AS “c”)

instead of:

PASSING CAST(:cid AS INTEGER) AS “c”)

If you discover any other mistakes that we made in the book, please do let us know. Thanks!

Many enterprises hold significant amounts of relational data in their databases. New applications and integration scenarios often require this data to be communicated in the form of XML messages, e.g. in service-oriented architectures (SOA), web services, application integration (A2A), or data exchange with internal and external parties (B2B). Fortunately, the SQL language contains a set of functions that let you convert relational data into XML, right in your SQL queries that read the required data. This is easier and more efficient than writing complex application code.

Some of the most important SQL/XML construction functions include:

  • XMLELEMENT, to construct an XML element from a relational value
  • XMLATTRIBUTES, to add attributes to a constructed element
  • XMLFOREST, to construct a sequence of elements
  • XMLNAMESPACES, to add a namespace to an element
  • XMLAGG, to aggregate multiple XML elements that stem from separate relational rows into a single block of XML

In many cases it is particularly convenient to encapsulate XML construction in a function or a view that applications can simply use in their SQL statements without having to be concerned with the details of the XML construction itself.

Let’s look at a few simple examples. First, assume that we have the following sample table with 3 rows of data:

CREATE TABLE employee(firstname varchar(30),
                      lastname  varchar(30),
                      dept      varchar(5));

INSERT INTO employee(firstname, lastname, dept)
 VALUES  ('John', 'Doe', 'A01'),
         ('Mary', 'Jones', 'B02'),
         ('Peter', 'Pan', 'A01');

In a real application, each employee record would have many more columns but for simplicity let’s just work with these 3 columns f0r now. Assume that we need to retrieve each employee record in a specific XML format with a root element called “emp”, an attribute for the department numbers, and two child elements “first” and “last” for the first name and last, respectively. We can define the following user-defined scalar function. This function takes several relational column values as input returns a result of type XML.

CREATE FUNCTION emplxml(first VARCHAR(30), last VARCHAR(30), dept VARCHAR(5))
RETURNS XML
RETURN XMLELEMENT(name "emp",
                  XMLATTRIBUTES(dept AS "dept"),
                  XMLFOREST(first, last) ) ;

This user-defined function (UDF) performs the desired XML construction for a given employee record. An application can then use the UDF to easily retrieve the name and department information in XML format:

SELECT emplxml(firstname, lastname, dept)
FROM employee;

<emp dept="A01"><FIRST>John</FIRST><LAST>Doe</LAST></emp>
<emp dept="B02"><FIRST>Mary</FIRST><LAST>Jones</LAST></emp>
<emp dept="A01"><FIRST>Peter</FIRST><LAST>Pan</LAST></emp>

3 record(s) selected.

Now let’s assume we need one XML document per department instead of one XML per employee. That means that we need to group the employee rows by department and aggregate the rows for each department into a single XML document. We can probably do that in a table function, but it can also be done conveniently in a view:

CREATE VIEW deptxmlview(deptID, deptdoc)
  AS SELECT dept,
            XMLELEMENT(NAME "Department",
                       XMLATTRIBUTES (dept AS "name" ),
                       XMLAGG(  XMLELEMENT(NAME "emp", lastname)  )  )
FROM employee
GROUP BY dept;

Again, an application does not need to be concerned with details of the XML aggregation and construction. It just needs to know that it can read one XML document per department from the view:

SELECT deptdoc
FROM deptxmlview
WHERE deptID = 'A01' or deptID = 'B02';

<Department name="A01"><emp>Doe</emp><emp>Pan</emp></Department>
<Department name="B02"><emp>Jones</emp></Department>

2 record(s) selected.

If you need to build more complex XML documents from your relational tables, you can certainly join multiple tables and apply the SQL/XML construction functions to columns produced by the join. You can also encapsulate pieces of XML construction in views and UDFs and use them as building blocks to construct bigger pieces of XML in a modular fashion.

I have been asked many times: “What is faster, XML or Relational?”. Of course, this question oversimplifies a complex issue, and so the only valid answer is “It depends!”. Sometimes people ask the same question in a slightly different way: “If I have a relational table and convert each row into a small XML document and store these documents in a separate table with 1 XML column, what’s the performance difference between the two tables (for inserts/updates/queries)?”. But, in most cases such a conversion is not recommended and this type of comparison is, again, too simplistic.

Let’s say you want store, index, and query 1 million addresses and each address has a first name, last name, street, city, state, and zip code. That’s a simple and fixed structure and it’s the same for all records. It’s a perfect fit for a relational database and can be stored in a single table. Relational databases have been optimized for decades to handle such fixed records very efficiently. However, if the application needs to convert the address data to XML format anyway, it can often be faster to store the data permanently in XML and avoid the repeated conversion from relational format to XML format in the application.

Now consider a scenario where the business objects of interest are a lot more complex and variable than simple addresses. For example, derivative trades in the financial industry are modeled in XML with an XML Schema called FpML (financial products markup language). It defines more than 7000 fields (many are optional) with hundreds of 1-to-many relationships between them. Designing a relational schema to represent such objects is very hard and leads to hundreds of tables. The process of inserting (normalizing) and reading (denormalizing) a single object into such a relational schema can easily be 10x or 100x slower than inserting and reading a corresponding XML document in a native XML column (e.g. in DB2 pureXML).

So, any performance comparison of XML versus relational depends heavily on which data you choose for the comparison, and what type of operations you measure.

A large DB2 customer recently compared XML to relational performance because their business objects are currently mapped to 12 relational tables. Their application executes at least 12 SQL statements to retrieve all the relational data that comprises one of the logical business objects. Then the application reassembles the original business object. An alternative is to store these business objects as XML so that each object is stored as a single document. Instead of 12 tables, only 1 table with 1 XML column is then needed. In multi-user tests for data retrieval the the company found that the XML-based solution allows them to retrieve objects with 55% higher throughput than the existing relational SQL-based solution. The reasons for the performance benefit include fewer JDBC interactions between application and database as well as fewer distinct pages that need to be read when one logical business object is represented as one XML document (and not scattered over 12 tables). These tests were later repeated and verified at an IBM lab.

Another important considertion in the question of XML versus relational is the data format in which the data is produced and consumed outside the database. If the data is produced and/or consumed in XML format anyway, it is often better to also store the data as XML in the database.

So, the question “What is faster, XML or Relational?” is somewhat like asking “What is faster, a truck or a ship?”, because XML and relational are meant for different purposes, and either one can outperform the other depending on which use case you look at. And there are also use cases (with high schema complexity and schema variability over time) that cannot reasonably be implemented in a relational data model. (If you need to go to Hawaii, the boat always beats the truck!).

The beauty of a hybrid database system such as DB2 is that you can use both native XML and relational capabilitues side by side, in a tightly integrated manner. Some data is better representated in relational tables, other data is better represented in XML, and you can manage both in the same database or even the same table, and with the same APIs and utilities.

How to order query results based on XML values…. this seems like a simple question, doesn’t it? It’s indeed not hard, but there are a few useful things to know about it.

Assume we have a set of XML documents that describe “employee” information, and we want to return the employee names (/emp/name) ordered by the employees’ ID numbers (/emp/@id).

This is very easy to write in XQuery:

for $i in db2-fn:xmlcolumn("EMPLOYEE.DOC")/emp
order by $i/@id
return $i/name/text()

In this example the employee documents are stored in the XML column “DOC” in the table “EMPLOYEE”. The order by clause reorders the sequence of “emp” elements based on the @id attribute, and that’s the order in which the names. At the end of the order by clause you can add the keyword ascending or descending to choose a sort order.

Now, let’s assume we are querying a mix of XML and relational data, which is a common requirement, and want to do the same ordering in a SQL/XML query. As a first (but incorrect) attempt we might write the following query:

SELECT XMLQUERY('$i/emp/name/text()' PASSING doc AS "i")
FROM employee
ORDER BY XMLQUERY('$i/emp/@id' PASSING doc AS "i");

SQL20353N An operation involving comparison cannot use operand
"Ordering column 1" defined as data type "XML". SQLSTATE=42818

The error message SQL20353N produced by DB2 tells us that we cannot order on a column of type XML. In this example, the XML column that we try sort by is produced by the XMLQUERY function in the SQL order by clause. The XMLQUERY function always produces values of data type XML.

The SQL/XML standard defines that values of type XML cannot be compared to determine whether one is greater than the other. This restriction makes sense, because an individual value of type XML can be a full document, a document fragment, a sequences of multiple values, or a single atomic value. In SQL/XML there is no well-defined way to decide, for example, whether one XML document is greater than another, or whether an XML document is greater than 5.

Since comparison is at the heart of ordering, ordering on XML values in an SQL order by clause is not possible – UNLESS you cast the XML values to a data type that SQL can reasonably compare and order. The following query converts the extracted @id attribute to INTEGER, which allows the sorting as you would expect:

SELECT XMLQUERY('$i/emp/name/text()' PASSING doc AS "i")
FROM employee
ORDER BY XMLCAST( XMLQUERY('$i/emp/@id' PASSING doc AS "i") as INTEGER);

Another option is to sort on a column produced by the XMLTABLE function:

SELECT empid, empname
FROM employee,
     XMLTABLE('$i/emp' PASSING doc AS "i"
        COLUMNS
           empname VARCHAR(25) PATH 'name',
           empid   INTEGER     PATH '@id')
ORDER BY empid ;

Happy sorting!

A couple of months ago I wrote about the updated version 9.7.2 of the free DB2 Express, which includes all of DB2′s base features including DB2 pureXML. Now there is a DB2 Express-C Light version for Windows.

“Light” means that  the DB2 user interface is only in English (other languages cannot be chosen during install), and that the DB2 Control Center, DB2 Text Search, the Configuration Assistant, and the “First Steps” tutorial components are not included. This reduces the size of the download image by 44%!

The pureXML features are still included in the light version. If you do want a graphical user-interface you may want to get the free Version of Data Studio. Yes, that’s a separate download with its own footprint, but it contains many features -including XML features- that aren’t in the Control Center.

Once you get into the download dialog of the DB2 Express-C web page, you can choose either the full or the light version, as shown below:

Structure-agnostic XML indexes are indexes that you can define without knowledge about the structure and format of the XML documents that will be stored and queried. That is, the XML element and attribute names and their nesting is not known in advance. In such cases, the future queries and their search conditions are often also unknown.

Whether you need structure-agnostic XML indexes or not depends on your XML application. Let’s consider two cases:

1. Known XML Schema (often good for path-specific indexes):

Many XML applications have an XML Schema that defines the expected and permissible XML document format, i.e. you do know the XML element/attribute names and their nesting ahead of time. With some knowledge of the application you can then make an educated guess which elements or attributes are likely to be used in search conditions or join predicates.

With this knowledge you can define path-specific XML indexes, to index just those elements and attributes that likely appear in predicates. This is similar to the common practice for relational database tables, where you don’t index all columns but only those that are frequently constrained in queries.

For example, if you expect predicates on the zipcode of customer address information, and you know that the path is /customer/addr/zipcode, then you can define the foll0wing index in DB2:


create index zipIdx on customer(xmldoc) generate keys using
xmlpattern '/customer/addr/zipcode' as sql varchar(20);

DB2 also offers an XML index advisor that looks at an XML workload (a set of XML queries) and recommends a set of beneficial XML indexes for specific elements and attributes.

Path-specific indexes have the benefit that you index only what needs to be indexed, which conserves storage space and supports high insert/update/delete performance for XML documents.

Interestingly, Microsoft SQL Server does not support path-specific XML indexes. It forces you to index ALL elements and ALL attributes, even if you require only some of them to be indexed.

2. Unknown XML Schema (structure-agnostic indexes):

If the schema and expected queries are unknown, then you can choose to “index everything”. For example, in DB2 you can define the following two indexes. The first index covers the text nodes of all XML elements, and the second index covers all XML attributes.


create index elemIdx on customer(xmldoc) generate keys using
xmlpattern '//text()' as sql varchar(7985);

create index attrIdx on customer(xmldoc) generate keys using
xmlpattern '//@*' as sql varchar(7985);

Subsequently, query predicates on an element’s text node or on any attribute can use one of these indexes.

These two indexes are limited to element and attribute values of up to 7985 bytes per value, which is more than sufficient for most applications. Alternatively you can also define these indexes as VARCHAR HASHED, to index strings of unlimited length via hashing. Either way, these indexes are always maintained synchronously, i.e. upon or before the commit of any insert, update, or delete operation on the collection of XML documents.

Another option for structure-agnostic XML indexing is the DB2 Full-Text search feature, which I wrote about in February.

Structure-agnostic indexes that “index everything” are somewhat more expensive than selective path-specific indexes in terms of space and insert/update/delete performance. But, that’s the price for complete flexibility and for supporting unknown queries over unknown documents.

For more information on XML indexing in DB2, see the article on index exploitation or Chapter 13 in the DB2 pureXML Cookbook.



In my previous post I promised to write more about the 10TB XML data warehouse benchmark that IBM and Intel performed earlier this year.

Before I get into the details of that benchmarking exercise, let me talk about how and why we got there.

The first wave of XML database applications were (and continue to be) operational systems, often performing message-based transaction processing with XML as the message format. Corresponding to the adoption and growth of such operational XML applications, IBM and Intel have collaborated to verify that current hardware (such as Intel servers) and software (such as DB2) were able to meet the demands of XML transaction processing systems. This started with moderate 50GB and 100GB XML benchmarks in 2006, all the way to the industry’s first 1TB XML benchmark in late 2008.

As companies accumulate large amounts of XML data in their operational system, they realize that they are sitting on a goldmine of information and are eager to use it for reporting and business intelligence purposes. Accustomed to the benefits of mature relational data warehouses, many companies now require and expect the same capabilities for XML data. They want to run complex analytical queries over XML. Therefore we want to establish performance and scalability proofpoints to show that this is indeed possible.

Benchmark goals and methodology

The goal of the 10TB benchmark was to demonstrate linear scalability when an XML data warehouse grows in size. Many data warehouses grow over time as new data keeps being added to the existing data. Running analytical queries over larger amounts of historical data can provide a more accurate understanding of the business and allow more accurate predictions for the future.

The value of linear scalability is the following: you can increase the data volume in your warehouse, add computing resources (CPUs, disks, memory) proportionally to the data, and hence keep the performance of your workload constant. For example, if a query analyzes 2x as much data and is given 2x as much resources, then the elapsed time should be the same.

An equivalent way to think about linear scalability is that you keep the data volume constant and e.g. double the hardware resources to reduce the response times by half.

In this benchmark we ran a decision support workload -consisting of 16 complex analytical SQL/XML queries- on two databases:

(A) 3.33TB of raw XML data and using 1/3 of the available computing resources
(B) 10TB of raw XML data and using all of the computing resources

We expect to see the same performance (response times, throughput) on both databases. On the 10TB database each query processes 3x as much data as on the 3.33TB database. But, with 3x as much resources, this will be accomplished in the same amount of time as on the 3.33TB database. That’s linear scalability.

The 10TB benchmark configuration

DB2 9.7 Fixpack 1, using pureXML and compression, was run on a cluster of three Intel Nehalem EX Servers (Xeon 7500). Each of the servers had four 8-core CPUs and 128GB of memory. The operating system was Linux RHEL 5.4. The storage subsystem was an IBM DS8700 system, housing 48 RAID5 arrays of 8 disks each. The DB2 database partitioning feature was used, running 48 database partitions on this cluster, i.e. 16 on each of the three servers.

The 3.33TB database used only one of the three servers in the clusters, 16 database partitions, and only 16 of the 48 RAID5 arrays.

Data Volume and Workload

We used the XML data generator of the TPoX 2.0 benchmark.

Database A: 3.33 TB of raw data, about 1.83 Billion XML documents:

  • 1,666,500,000 orders
  • 166,500,000 customers with their accounts
  • 20,833 securities

Database B: 10TB of raw data, about 5.5 Billion XML documents:

  • 5,000,000,000 orders
  • 500,000,000 customers with their accounts
  • 20,833 securities

Each of the two databases contained three partitioned tables, one for order documents, one for customer-accounts, and one for securities. About 20 XML indexes were defined to support the decision support workload. The 16 SQL/XML queries included full table scans, grouping and aggregation, OLAP functions, joins across two or all three XML tables, and various combinations of XML predicates. The same 16 queries were run on both databases. The selectivity of each query (in terms of percentage of XML documents touched) is the same in both databases.

Results

The following chart (click it!) compares the response times of the 16 queries in the 3.33TB database (yellow bars) and the 10Tb database (red bars). For each query we can see that the response is approximately the same on both databases, which confirms the linear scalability that is so important in data warehousing.

I have purposefully omitted absolute numbers from the vertical axis in this chart, because this result and this benchmark was not about maximizing absolute performance. It was about scalability, i.e. relative performance between two different scale factors and configurations. For example, some queries could have been even faster -in both databases- if we had used twice as many DB2 database partitions per server.

A summary article on this and other XML database benchmarks can be found in the IBM Data Management Magazine.

Earlier this year I wrote about XML and Data Warehousing, and I continue to see use cases where companies integrate XML data in their warehouses.

Some companies collect XML messages from transactional systems and use DB2′s pureXML features to extract a subset of the XML elements from each message and integrate these values into the relational tables of their warehouse. In other cases this mapping from XML to relational is avoided and XML documents are stored “as-is” in the warehouse.

There is a very nice podcast that explains in 5 minutes and 47 seconds why and how companies are integrating XML into their data warehouses:
http://www.ibm.com/podcasts/software/data/infosphere/index.rss

If you are interested in another discussion of this topic, you might want to attend this upcoming free Webinar offered by The Data Warehousing Institute (TDWI) :

Title: XML – The Key to the Next Generation Data Warehouse,
Date: Tuesday, June 29, 2010
Time: 9am US Pacific Time, 12noon US Eastern Time
Registration: online

Enjoy!

Earlier this year I was involved in an XML data warehouse benchmark with 10TB worth of XML documents. I will share some details in my next blog post…



An XQuery Cheat Sheet

June 17, 2010

Lee Ackermann has just published a nice article on developerWorks titled “Writing XQuery and SQL/XML queries for DB2 pureXML“. I’m calling it a Cheat Sheet because it’s not a tutorial for how to write XML queries, but a useful list of the most common types of of  XML queries and updates. So, if you have a general idea how to write queries but just can’t seem to get the syntax right, or need an example here and there to refresh your memory, then this article is for you!



In case you haven’t heard about DB2 Express-C: It’s a free of version of the DB2 database server. The “C” stands for “Community Edition”.

And “free” means just that: there is no charge regardless of the number of installations, no matter whether you use it for trial, development, or production purposes. All of the core DB2 features are available, including all XML, SQL/XML, and XQuery features. This makes DB2 Express-C not only an attractive alternative to mySQL, but also to free XML databases such as eXist and others – especially if you need to manage both XML documents and relational data at the same time.

The version number 9.7.2 means that the previous DB2 Express 9.7 has been refreshed with the second Fixpack for DB2 9.7. This also includes a variety of useful enhancements that you can read about in Antonio’s blog and DB2 Express-C team blog.

DB2 Express-C is available on Linux, Windows, Solaris for Intel, and as a beta version also on Mac OS. Also, there is no limit on the database size or the number of users and connections that are supported. The only restriction is that this free version of DB2 will not exploit more than 2GB RAM or more than 2 CPU cores.

Personally I’m using DB2 Express-C on my laptop so that I can prototype XML applications and test XML queries or XML-based stored procedures wherever I go.  And… I’m going to download DB2 Express-C 9.7.2 now so that I have the latest version with all fixes and all the new goodies!

Recently I got asked several times how to get a list of all elements and attributes from an XML document, or from all documents in an XML column. Here are some ideas of how you can get this information with SQL/XML queries.

As a simple example, let’s assume we have the following XML document stored in the XML column PORDER of the table “purchaseorder”:

<PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped">
 <item>
    <partid>100-100-01</partid>
    <name>Snow Shovel, Basic 22 inch</name>
    <quantity>3</quantity>
    <price>9.99</price>
 </item>
 <item>
    <partid>100-103-01</partid>
    <name>Snow Shovel, Super Deluxe 26 inch</name>
    <quantity>5</quantity>
    <price>49.99</price>
 </item>
</PurchaseOrder>

You can run the following query to get a list of all elements in the (depth-first) order in which they appear in the document:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//*'
      COLUMNS
        seq       FOR ORDINALITY,
        element   VARCHAR(20) PATH 'name(.)',
        parent    VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ    ELEMENT              PARENT
------ -------------------- --------------------
     1 PurchaseOrder
     2 item                 PurchaseOrder
     3 partid               item
     4 name                 item
     5 quantity             item
     6 price                item
     7 item                 PurchaseOrder
     8 partid               item
     9 name                 item
    10 quantity             item
    11 price                item

 11 record(s) selected.

How does that work? The XMLTABLE function produces one row for each XML node that is produced by the so-called row-generating expression, which in this case is //*. The variable $PORDER simply refers to the XML column in the table. The // navigates exhaustively through every branch and every level of the document, and the * matches any element that is encountered. Thus, one row is produced for each element.

Then the COLUMNS clause defines the content of each row in the result set. The column “seq” is defined with the keywords FOR ORDINALITY to simply number the rows sequentially. The column “name” uses the the XPath function name() to retrieve the name of each element. In the third column, “parent”, the XPath ../name() retrieves the name of the parent of each element, which provides us with a sense of the document structure.

If your element names are longer than 20 characters then you may need to increase the length of the VARCHAR(20) columns in the query above.

You probably noted that the query result above did not list any attributes that occur in the original XML document. To get all attributes, you can use //@* instead of //* :

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//@*'
      COLUMNS
       seq         FOR ORDINALITY,
       attribute   VARCHAR(20) PATH 'name(.)',
       parent      VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ   ATTRIBUTE            PARENT
----- -------------------- --------------------
    1 PoNum                PurchaseOrder
    2 OrderDate            PurchaseOrder
    3 Status               PurchaseOrder

 3 record(s) selected.

But, most likely you would like to get a combined list of elements and attributes. To achieve this, combine the expressions //* and //@* as shown in the next query. The expression //(*, @*) means that you want to traverse the entire document and match any element or attribute that is encountered:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
       seq    FOR ORDINALITY,
       node   VARCHAR(20) PATH 'name(.)',
       parent VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ   NODE                 PARENT
----- -------------------- --------------------
    1 PurchaseOrder
    2 PoNum                PurchaseOrder
    3 OrderDate            PurchaseOrder
    4 Status               PurchaseOrder
    5 item                 PurchaseOrder
    6 partid               item
    7 name                 item
    8 quantity             item
    9 price                item
   10 item                 PurchaseOrder
   11 partid               item
   12 name                 item
   13 quantity             item
   14 price                item

 14 record(s) selected.

At this point you are probably wondering how to distinguish elements from attributes in the list above. Ok, we can extend the previous query to provide this extra information. The next query has the additional column “type” whose value is defined by an XQuery if-then-else expression that checks whether the current node is an attribute. If not, then we simply assume it’s an element:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
       seq    FOR ORDINALITY,
       node   VARCHAR(20) PATH 'name(.)',
       type   VARCHAR(15) PATH 'if (self::attribute())
                                then "Attribute"
                                else "Element"',
       parent VARCHAR(20) PATH '../name(.)' ) AS T
WHERE poid = 5000;

SEQ   NODE                 TYPE            PARENT
----- -------------------- --------------- --------------------
    1 PurchaseOrder        Element
    2 PoNum                Attribute       PurchaseOrder
    3 OrderDate            Attribute       PurchaseOrder
    4 Status               Attribute       PurchaseOrder
    5 item                 Element         PurchaseOrder
    6 partid               Element         item
    7 name                 Element         item
    8 quantity             Element         item
    9 price                Element         item
   10 item                 Element         PurchaseOrder
   11 partid               Element         item
   12 name                 Element         item
   13 quantity             Element         item
   14 price                Element         item

 14 record(s) selected.

And finally, wouldn’t it be nice to also display the value of each element and attribute in this list? The next query includes the column “value” for this purpose. The condition “if (not(./*))” checks whether a node in the document has no children. If this is true, then it’s typically an attribute or a leaf element for which it makes sense to report a value:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
        seq    FOR ORDINALITY,
        node   VARCHAR(20) PATH 'name(.)',
        value  VARCHAR(20) PATH 'if (not(./*))
                                 then substring(.,1,20) else ()',
        type   VARCHAR(15) PATH 'if (self::attribute())
                                 then "Attribute"
                                 else "Element"',
        parent VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ NODE           VALUE               TYPE       PARENT
--- -------------- ------------------- ---------- -------------
  1 PurchaseOrder  -                   Element
  2 PoNum          5000                Attribute  PurchaseOrder
  3 OrderDate      2006-02-18          Attribute  PurchaseOrder
  4 Status         Unshipped           Attribute  PurchaseOrder
  5 item           -                   Element    PurchaseOrder
  6 partid         100-100-01          Element    item
  7 name           Snow Shovel, Basic  Element    item
  8 quantity       3                   Element    item
  9 price          9.99                Element    item
 10 item           -                   Element    PurchaseOrder
 11 partid         100-103-01          Element    item
 12 name           Snow Shovel, Super  Element    item
 13 quantity       5                   Element    item
 14 price          49.99               Element    item

 14 record(s) selected.

Ok, you probably get the picture. You can further refine and adjust these queries to your needs. Let me know if you have any questions on any of this…

How and why do companies use XML? In which cases do they decide to store, index, and query XML in a databases? What are some of the common patterns for such XML usage?

We will shed some light on these questions in a Webcast on May 18, at 11am US Eastern Time (16:00 in the UK, 17:00 in Germany, 20:30 in India, etc.).

We will describe XML use cases that are applicable to a variety of industries, such as finance, insurance, healthcare, and government, and we’ll discuss the benefits that XML brings to these application scenarios.

The webcast will be of interest to users who are new to managing XML in a database as well as to experienced XML practitioners. It will also benefit IT architects who have an interest in XML strategies. Basic understanding of databases and of XML will be useful for this webcast.

Detail information such as the URL and telephone numbers for this webcast are available on a separate Wiki page.

XML is a very flexible data and document format, and the content of a document may not always be what you expect. As an example, look at the <order_date> elements in the following four documents. You see that

  • in the 1st document the order_date element contains a valid timestamp
  • in the 2nd document the order_date element is empty
  • in the 3rd document the order_date element is missing
  • in the 4th document the order_date element contains an invalid timestamp value

create table mytest(doc XML);

insert into mytest values('
<order id="1">
 <customer>A</customer>
 <order_date>2010-04-21T22:09:03.75</order_date>
</order>');

insert into mytest values('
<order id="2">
 <customer>B</customer>
 <order_date></order_date>
</order>');

insert into mytest values('
<order id="3">
 <customer>C</customer>
</order>');

insert into mytest values('
<order id="4">
 <customer>D</customer>
 <order_date>May 15</order_date>
</order>');

If each order must have a valid timestamp in the order_date element then you can validate each document against an XML Schema and reject (or otherwise handle) those documents that do no contain a valid order_date.

However, there can be cases where you might prefer (or have to) to insert all documents without validation against a fixed schema. Then you need to handle the “dirty data” appropriately in your queries to avoid errors.

For example, the following query tries to return the order ID, customer ID, and order data from each document in our sample table. But, the query fails -as expected- when it encounters documents 2 or 4, because neither the empty string nor the string “May 15″ is a valid timestamp. Hence, casting the order_date to the SQL type “timestamp” in the XMLTABLE function fails:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

SQL16061N  The value "" cannot be constructed as, or cast (using an
implicit or explicit cast) to the data type "xs:dateTime".
Error QName=err:FORG0001.SQLSTATE=10608

The next query contains a predicate to only read orders 1 and 3. The query succeeds because the missing order_date element in order 3 automatically leads to a NULL:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       -

 2 record(s) selected.

If you prefer the missing element to produce a default value other than NULL, use the DEFAULT clause in the XMLTABLE function. Note that the default value that you choose must match the column date type, in this case timestanp. Here are a couple of examples:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT '1900-01-01' path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       1900-01-01-00.00.00.000000

 2 record(s) selected.

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT current_timestamp path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       2010-05-09-15.24.14.156000

 2 record(s) selected.

Now how do we deal with the invalid timestamp values in documents 2 and 4? If the XML hasn’t been validated against a schema at insert time, a proper type validation at run time can be done with the XQuery construct castable. In the following example, the XQuery if-then-else expression returns the element order_date if its value is a proper timestamp, and otherwise it returns the empty sequence – which is the same as a missing order_date element and leads to a NULL or default value:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'if (order_date castable as xs:dateTime)
                              then order_date
                              else ()'        ) as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

This query can be written in a more compact fashion, avoiding the explicit if-then-else construct. You can simply use castable in a XPath predicate in square brackets, in which the dot refers to the current element. The expression order_date[. castable as xs:dateTime] returns the order_date element if its castable to a timestamp (xs:dateTime), otherwise it returns the empty sequence.

select T.*
from mytest, XMLTABLE('$DOC/order'
        COLUMNS
          oid        integer     path '@id',
          cust_id    char(3)     path 'customer',
          order_date timestamp   path 'order_date[. castable as xs:dateTime]') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

If you want to return the value of the order_date element regardless of whether it’s a valid timestamp or not, you can simply change the column type in the XMLTABLE function to VARCHAR, as in the next query. However, depending on your application this may or may not be desirable. Note that the order_date column in the result set contains an empty string in the 2nd row, but a NULL value in the 3rd row:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer       path '@id',
 cust_id    char(3)       path 'customer',
 order_date varchar(25)   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- -------------------------
1           A       2010-05-09T22:09:03.75
2           B
3           C       -
4           D       May 15

 4 record(s) selected.

You can read more about the XMLTABLE function in the developerWorks article “XMLTABLE By Example“, or in Chapter 7 and Chapter 9 of the DB2 pureXML Cookbook.



If you -or your colleagues- are new to managing and querying XML data in DB2, then this 5-part article series “Get off to a fast start with DB2 9 pureXML” might be a good starting point. These articles were originally written for DB2 9.1, but recently all five parts have been updated for DB2 9.5 and 9.7. Here they are:

Part 1: XML to the core (Introduction)

Part 2: Create and populate a DB2 XML database

Part 3: Query DB2 XML data with SQL

Part 4: Query DB2 XML data with XQuery

Part 5: Develop Java applications for DB2 XML data

If you want to get a little bit deeper from here, I can recommend a few additional resources:

For example, if your interest is in querying XML data, take a look at “pureXML in DB2 9: Which way to query your XML data?” or “XMLTABLE by Example“.

For a deep dive on XML updates and transformations, see “Updating XML in DB2 9.5“.

In case you’re a Java developer and interested in the new XML features in JBDC 4.0, read my previous blog post on “XML Support in JDBC 4.0: The SQLXML Interface“.

XML performance topics are covered in “15 Best Practices for pureXML Performance in DB2 9” and “Exploit XML Indexes for XML Query Performance in DB2 9“. These two articles provide tuning and configuration guidelines for XML performance in DB2. But, if you’re more interested in performance results and measurements, see “A Performance Comparison of DB2 9 pureXML with CLOB and Shredded XML Storage or the results of the TPoX benchmark.

If these articles don’t provide what you’re looking for, browse through the DB2 pureXML Wiki or the DB2 pureXML Cookbook. – or post a comment to this blog with any requests or questions you might have!

Happy Reading…

http://www.ibm.com/developerworks/db2/library/techarticle/dm-0710nicola/

If you are using DB2 on the z/OS platform and haven’t had a chance yet to explore its pureXML features, consider attending the DB2 for z/OS pureXML Hands-on Lab at this year’s DB2 Users Group conference (IDUG).

In this 2.5-hour lab you will create tables with XML columns, examine the underlying storage objects, define XML indexes, load XML documents, and use SQL/XML functions -such as XMLTABLE- to query and manipulate XML.

The IDUG 2010 North America conference is on May 10-14 in Tampa/Florida and offers more than 100 technical sessions from DB2 users, experts from third party vendors, DB2 Gold Consultants, and DB2 experts from IBM.

I can highly recommend this IDUG event. I have attended the IDUG conference many times and found it very valuable. There is so much useful information in the sessions, and the networking within the DB2 community is simply priceless (and fun!). The main focus areas are DB2 for Linux, UNIX, Windows and DB2 for z/OS as well as application development and tools for these systems.

If you want to read up on DB2 for z/OS pureXML, here are some good resources:

If you are or will be using XML in a relational database, chances are that you will be using SQL/XML to query XML and relational data in an integrated manner. Most XML applications are not black or white, meaning they do not use XML exclusively without any relationship to any current or existing relational data. More often than not, XML documents somehow relate to some structured data for which the relational data model might be the better choice. This makes the combined management of XML documents and relational data in a single database -often even in a single table!- so appealing.

One of the most popular and most versatile functions in the SQL/XML language standard is the XMLTABLE function. Let’s look a simple example.

First, let’s a create a simple database table that contains a mix of relational data and XML:

create table products(id integer, name varchar(20), details XML);

insert into products
values(1, 'Snow Shovel',
'<product>
  <fullname>Snow Shovel, Deluxe Edition</fullname>
  <category>S5-D</category>
  <color>Silver</color>
</product>');

insert into products
values(2, 'Gloves',
'<product>
  <category>K16-F</category>
  <color>Blue</color>
  <size range="Medium">7</size>
</product>');

Now assume we want to retrieve the id, name, category, color, and size for some or all of the products. We can issue a SQL/XML query such as the following:

select id, name, T.*
from products,
  XMLTABLE('$DETAILS/product'
    COLUMNS
      cat   VARCHAR(10) PATH 'category',
      color VARCHAR(10) PATH 'color',
      size  VARCHAR(10) PATH 'size/@range') as T;

ID          NAME                 CAT        COLOR      SIZE
----------- -------------------- ---------- ---------- ----------
 1          Snow Shovel          S5-D       Silver     -
 2          Gloves               K16-F      Blue       Medium

 2 record(s) selected.

The XMLTABLE function in the “from” clause is a table function that produces zero, one, or multiple rows for each XML document. In this example it produces one row for each XML document in the “product” table. Implicitly there is an inner join between the XMLTABLE function and the “product” table.

Now what happens if there is a row whose XML column is NULL? Let’s insert such a row and repeat the previous query:

insert into products values(3, 'Ice Scraper', NULL);

select id, name, T.*
from products,
  XMLTABLE('$DETAILS/product'
    COLUMNS
      cat   VARCHAR(10) PATH 'category',
      color VARCHAR(10) PATH 'color',
      size  VARCHAR(10) PATH 'size/@range') as T;

ID          NAME                 CAT        COLOR      SIZE
----------- -------------------- ---------- ---------- ----------
 1          Snow Shovel          S5-D       Silver     -
 2          Gloves               K16-F      Blue       Medium

 2 record(s) selected.

You may be surprised that the ID and NAME of the newly inserted product do not show up in this query result. A corresponding row is missing from the result set, because for this row there is no XML document on the XML-side of the inner join.

What we need in this case is a left outer join, so that the relational columns from the base table are returned even if the XML column (the other side of the implicit join) is NULL:

select id, name, T.*
from products LEFT OUTER JOIN
 XMLTABLE('$DETAILS/product'
   COLUMNS
     cat   VARCHAR(10) PATH 'category',
     color VARCHAR(10) PATH 'color',
     size  VARCHAR(10) PATH 'size/@range') as T ON 1=1;

ID          NAME                 CAT        COLOR      SIZE
----------- -------------------- ---------- ---------- ----------
 1          Snow Shovel          S5-D       Silver     -
 2          Gloves               K16-F      Blue       Medium
 3          Ice Scraper          -          -          -

 3 record(s) selected.

The join condition of this left outer join is trivial: 1=1, which always evaluates to true. It means that we want this left outer join to be applied to every row in this table.

You can read more about the XMLTABLE function in the developerWorks article “XMLTABLE By Example“, or in Chapter 7 and Chapter 9 of the DB2 pureXML Cookbook.

The XMLTABLE function is part of the SQL standard. DB2 supports it, Oracle does too, but as far as I know the XMLTABLE function is not available in SQL Server 2008.

More on XML ETL

April 5, 2010

To follow up on my previous post about XML ETL, the second part of the paper on using the InfoSphere Design Studio for ETL with XML has been published:

Using IBM InfoSphere Warehouse Design Studio with pureXML data, Part 2: Create a control flow for multiple ETL jobs involving XML

The first part of this article explained how to define and run an individual ETL data flow that uses XML data as input to populate relational and/or XML columns in a data warehouse. This second part describes how to build a single control flow that calls multiple data flows. Each data flow can extract, transform, and load XML data. A control flow allows you to orchestrate multiple data flows that have to be executed in a specific order. You can develop and test the individual data flows independently and use them as building block to build larger ETL processes, by using a control flow.



XML ETL

March 26, 2010

The handling of XML data has become increasingly important in ETL (extract, transform, and load) processes. Here are some typical use cases:

For example,  you might be accumulating XML messages  in a  transactional database and may need to extract selected values from these messages and add them to a relational data warehouse.

In other cases you might have to move XML data from one system to another and wish to transform the XML documents in the process. Depending on the predominant usage patterns in the target system, you might want to cut large documents into smaller XML fragments, extract only specific sections from each XML document, or rename XML elements. The source as well as the target of such transformations can be XML files in the file system or tables (with XML columns) in a DB2 database.

You can certainly code many useful transformations in SQL/XML, but you might prefer to use a visual tool to design ETL data flows. One  such tool is the IBM InfoSphere Warehouse Design Studio.

A new article on developerworks describes how to use the InfoSphere Design Studio to design ETL flows that involve XML data. As a concrete example, the article walks you through the steps of building a data flow that reads from an XML source table and populates two target tables in a data warehouse. One of the target tables contains only relational data, while the other contains both relational and XML data. Read the article here:

Using IBM InfoSphere Warehouse Design Studio with pureXML data, Part 1: Create an ETL data flow to populate a hybrid data warehouse.

This article is a two-part series and I’ll let you know when part 2 is published.

Due to popular demand there will be several additional DB2 pureXML bootcamps.

A DB2 pureXML bootcamp is a free 2-day (or 3-day) course consisting of presentations and hands-on labs where attendees can gain or deepen their DB2 pureXML skills. It’s for application developers, architects, DBAs, and other interested parties.

One bootcamp will be held in London, UK, on the 27th and 28th of April 2010. You can find more information about the pureXML bootcamp in London, including a link to register, here:
http://www.ibm.com/developerworks/wikis/display/db2xml/BootCampsUK

Also, there have been requests to hold a DB2 pureXML bootcamp in Dallas, Texas.  The tentative time frame is July or August. If you are interested in attending a pureXML bootcamp in Dallas or have any questions, please send me an email or contact my colleague Mary at desisto@us.ibm.com.

When you develop XML applications, you want good tools to design and debug XML artifacts such as XML documents, XML schemas, XQuery queries, SQL/XML statements, XSL style sheets, and other things. On occasions I have seen people use plain text editors such as vi or Notepad for this, but that’s not everyone’s cup of tea. I confess: when I write XQuery or SQL/XML I often use a simple text editor too, because it’s quick and simple. But, if I need to edit an XML Schema or XML document I certainly want a tool that’s XML-aware.

There are varies XML tools that work well with DB2 pureXML. For example, the IBM Data Studio IDE includes features for editing XML Schemas or documents and for designing XML queries. Third party tools that integrate well with DB2 pureXML include Altova XMLSpy, Stylus Studio, and <oXygen/>.

For example, XMLSpy’s integration with DB2 pureXML includes the following capabilities:

  • Edit, debug, and profile XQuery statements against XML data in DB2 databases. Query results are then available for further manipulation in XMLSpy
  • Visualize the database structure and query DB2 tables using SQL, SQL/XML, and XQuery
  • Read XML data from DB2, edit it, and store it back in DB2 with optional schema validation
  • Manage XML Schemas in DB2’s XML Schema Repository. For example, you can design new schemas in XMLSpy and register them in DB2, or read existing XML Schemas from DB2, edit them, and save them back into DB2
  • Transform XML data for use in other applications
  • and more…

If you want to learn more about using XMLSpy with DB2 pureXML, I recommend the following resources:

Website: Altova Tools for DB2 pureXML:
http://www.altova.com/xmlspy/db2-xml-editor.html
http://www.altova.com/solutions/ibm-db2-tools.html
http://www.altova.com/IBM-AltovaPartnership_120506.html

White Paper: Integration of Altova Tools with IBM DB2 pureXML:
http://www.altova.com/whitepapers/ibm.pdf

Tutorial: Using the Altova Tools with IBM DB2 pureXML:
http://www.ibm.com/developerworks/db2/library/long/dm-0712kogan/

Follow

Get every new post delivered to your Inbox.

Join 28 other followers