Recently I got asked several times how to get a list of all elements and attributes from an XML document, or from all documents in an XML column. Here are some ideas of how you can get this information with SQL/XML queries.

As a simple example, let’s assume we have the following XML document stored in the XML column PORDER of the table “purchaseorder”:

<PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped">
 <item>
    <partid>100-100-01</partid>
    <name>Snow Shovel, Basic 22 inch</name>
    <quantity>3</quantity>
    <price>9.99</price>
 </item>
 <item>
    <partid>100-103-01</partid>
    <name>Snow Shovel, Super Deluxe 26 inch</name>
    <quantity>5</quantity>
    <price>49.99</price>
 </item>
</PurchaseOrder>

You can run the following query to get a list of all elements in the (depth-first) order in which they appear in the document:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//*'
      COLUMNS
        seq       FOR ORDINALITY,
        element   VARCHAR(20) PATH 'name(.)',
        parent    VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ    ELEMENT              PARENT
------ -------------------- --------------------
     1 PurchaseOrder
     2 item                 PurchaseOrder
     3 partid               item
     4 name                 item
     5 quantity             item
     6 price                item
     7 item                 PurchaseOrder
     8 partid               item
     9 name                 item
    10 quantity             item
    11 price                item

 11 record(s) selected.

How does that work? The XMLTABLE function produces one row for each XML node that is produced by the so-called row-generating expression, which in this case is //*. The variable $PORDER simply refers to the XML column in the table. The // navigates exhaustively through every branch and every level of the document, and the * matches any element that is encountered. Thus, one row is produced for each element.

Then the COLUMNS clause defines the content of each row in the result set. The column “seq” is defined with the keywords FOR ORDINALITY to simply number the rows sequentially. The column “name” uses the the XPath function name() to retrieve the name of each element. In the third column, “parent”, the XPath ../name() retrieves the name of the parent of each element, which provides us with a sense of the document structure.

If your element names are longer than 20 characters then you may need to increase the length of the VARCHAR(20) columns in the query above.

You probably noted that the query result above did not list any attributes that occur in the original XML document. To get all attributes, you can use //@* instead of //* :

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//@*'
      COLUMNS
       seq         FOR ORDINALITY,
       attribute   VARCHAR(20) PATH 'name(.)',
       parent      VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ   ATTRIBUTE            PARENT
----- -------------------- --------------------
    1 PoNum                PurchaseOrder
    2 OrderDate            PurchaseOrder
    3 Status               PurchaseOrder

 3 record(s) selected.

But, most likely you would like to get a combined list of elements and attributes. To achieve this, combine the expressions //* and //@* as shown in the next query. The expression //(*, @*) means that you want to traverse the entire document and match any element or attribute that is encountered:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
       seq    FOR ORDINALITY,
       node   VARCHAR(20) PATH 'name(.)',
       parent VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ   NODE                 PARENT
----- -------------------- --------------------
    1 PurchaseOrder
    2 PoNum                PurchaseOrder
    3 OrderDate            PurchaseOrder
    4 Status               PurchaseOrder
    5 item                 PurchaseOrder
    6 partid               item
    7 name                 item
    8 quantity             item
    9 price                item
   10 item                 PurchaseOrder
   11 partid               item
   12 name                 item
   13 quantity             item
   14 price                item

 14 record(s) selected.

At this point you are probably wondering how to distinguish elements from attributes in the list above. Ok, we can extend the previous query to provide this extra information. The next query has the additional column “type” whose value is defined by an XQuery if-then-else expression that checks whether the current node is an attribute. If not, then we simply assume it’s an element:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
       seq    FOR ORDINALITY,
       node   VARCHAR(20) PATH 'name(.)',
       type   VARCHAR(15) PATH 'if (self::attribute())
                                then "Attribute"
                                else "Element"',
       parent VARCHAR(20) PATH '../name(.)' ) AS T
WHERE poid = 5000;

SEQ   NODE                 TYPE            PARENT
----- -------------------- --------------- --------------------
    1 PurchaseOrder        Element
    2 PoNum                Attribute       PurchaseOrder
    3 OrderDate            Attribute       PurchaseOrder
    4 Status               Attribute       PurchaseOrder
    5 item                 Element         PurchaseOrder
    6 partid               Element         item
    7 name                 Element         item
    8 quantity             Element         item
    9 price                Element         item
   10 item                 Element         PurchaseOrder
   11 partid               Element         item
   12 name                 Element         item
   13 quantity             Element         item
   14 price                Element         item

 14 record(s) selected.

And finally, wouldn’t it be nice to also display the value of each element and attribute in this list? The next query includes the column “value” for this purpose. The condition “if (not(./*))” checks whether a node in the document has no children. If this is true, then it’s typically an attribute or a leaf element for which it makes sense to report a value:

SELECT T.*
FROM purchaseorder,
     XMLTABLE('$PORDER//(*, @*)'
      COLUMNS
        seq    FOR ORDINALITY,
        node   VARCHAR(20) PATH 'name(.)',
        value  VARCHAR(20) PATH 'if (not(./*))
                                 then substring(.,1,20) else ()',
        type   VARCHAR(15) PATH 'if (self::attribute())
                                 then "Attribute"
                                 else "Element"',
        parent VARCHAR(20) PATH '../name(.)') AS T
WHERE poid = 5000;

SEQ NODE           VALUE               TYPE       PARENT
--- -------------- ------------------- ---------- -------------
  1 PurchaseOrder  -                   Element
  2 PoNum          5000                Attribute  PurchaseOrder
  3 OrderDate      2006-02-18          Attribute  PurchaseOrder
  4 Status         Unshipped           Attribute  PurchaseOrder
  5 item           -                   Element    PurchaseOrder
  6 partid         100-100-01          Element    item
  7 name           Snow Shovel, Basic  Element    item
  8 quantity       3                   Element    item
  9 price          9.99                Element    item
 10 item           -                   Element    PurchaseOrder
 11 partid         100-103-01          Element    item
 12 name           Snow Shovel, Super  Element    item
 13 quantity       5                   Element    item
 14 price          49.99               Element    item

 14 record(s) selected.

Ok, you probably get the picture. You can further refine and adjust these queries to your needs. Let me know if you have any questions on any of this…

How and why do companies use XML? In which cases do they decide to store, index, and query XML in a databases? What are some of the common patterns for such XML usage?

We will shed some light on these questions in a Webcast on May 18, at 11am US Eastern Time (16:00 in the UK, 17:00 in Germany, 20:30 in India, etc.).

We will describe XML use cases that are applicable to a variety of industries, such as finance, insurance, healthcare, and government, and we’ll discuss the benefits that XML brings to these application scenarios.

The webcast will be of interest to users who are new to managing XML in a database as well as to experienced XML practitioners. It will also benefit IT architects who have an interest in XML strategies. Basic understanding of databases and of XML will be useful for this webcast.

Detail information such as the URL and telephone numbers for this webcast are available on a separate Wiki page.

XML is a very flexible data and document format, and the content of a document may not always be what you expect. As an example, look at the <order_date> elements in the following four documents. You see that

  • in the 1st document the order_date element contains a valid timestamp
  • in the 2nd document the order_date element is empty
  • in the 3rd document the order_date element is missing
  • in the 4th document the order_date element contains an invalid timestamp value
create table mytest(doc XML);

insert into mytest values('
<order id="1">
 <customer>A</customer>
 <order_date>2010-04-21T22:09:03.75</order_date>
</order>');

insert into mytest values('
<order id="2">
 <customer>B</customer>
 <order_date></order_date>
</order>');

insert into mytest values('
<order id="3">
 <customer>C</customer>
</order>');

insert into mytest values('
<order id="4">
 <customer>D</customer>
 <order_date>May 15</order_date>
</order>');

If each order must have a valid timestamp in the order_date element then you can validate each document against an XML Schema and reject (or otherwise handle) those documents that do no contain a valid order_date.

However, there can be cases where you might prefer (or have to) to insert all documents without validation against a fixed schema. Then you need to handle the “dirty data” appropriately in your queries to avoid errors.

For example, the following query tries to return the order ID, customer ID, and order data from each document in our sample table. But, the query fails -as expected- when it encounters documents 2 or 4, because neither the empty string nor the string “May 15” is a valid timestamp. Hence, casting the order_date to the SQL type “timestamp” in the XMLTABLE function fails:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

SQL16061N  The value "" cannot be constructed as, or cast (using an
implicit or explicit cast) to the data type "xs:dateTime".
Error QName=err:FORG0001.SQLSTATE=10608

The next query contains a predicate to only read orders 1 and 3. The query succeeds because the missing order_date element in order 3 automatically leads to a NULL:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       -

 2 record(s) selected.

If you prefer the missing element to produce a default value other than NULL, use the DEFAULT clause in the XMLTABLE function. Note that the default value that you choose must match the column date type, in this case timestanp. Here are a couple of examples:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT '1900-01-01' path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       1900-01-01-00.00.00.000000

 2 record(s) selected.

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT current_timestamp path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       2010-05-09-15.24.14.156000

 2 record(s) selected.

Now how do we deal with the invalid timestamp values in documents 2 and 4? If the XML hasn’t been validated against a schema at insert time, a proper type validation at run time can be done with the XQuery construct castable. In the following example, the XQuery if-then-else expression returns the element order_date if its value is a proper timestamp, and otherwise it returns the empty sequence – which is the same as a missing order_date element and leads to a NULL or default value:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'if (order_date castable as xs:dateTime)
                              then order_date
                              else ()'        ) as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

This query can be written in a more compact fashion, avoiding the explicit if-then-else construct. You can simply use castable in a XPath predicate in square brackets, in which the dot refers to the current element. The expression order_date[. castable as xs:dateTime] returns the order_date element if its castable to a timestamp (xs:dateTime), otherwise it returns the empty sequence.

select T.*
from mytest, XMLTABLE('$DOC/order'
        COLUMNS
          oid        integer     path '@id',
          cust_id    char(3)     path 'customer',
          order_date timestamp   path 'order_date[. castable as xs:dateTime]') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

If you want to return the value of the order_date element regardless of whether it’s a valid timestamp or not, you can simply change the column type in the XMLTABLE function to VARCHAR, as in the next query. However, depending on your application this may or may not be desirable. Note that the order_date column in the result set contains an empty string in the 2nd row, but a NULL value in the 3rd row:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer       path '@id',
 cust_id    char(3)       path 'customer',
 order_date varchar(25)   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- -------------------------
1           A       2010-05-09T22:09:03.75
2           B
3           C       -
4           D       May 15

 4 record(s) selected.

You can read more about the XMLTABLE function in the developerWorks article “XMLTABLE By Example“, or in Chapter 7 and Chapter 9 of the DB2 pureXML Cookbook.