How to deal with dirty data

May 9, 2010

XML is a very flexible data and document format, and the content of a document may not always be what you expect. As an example, look at the <order_date> elements in the following four documents. You see that

  • in the 1st document the order_date element contains a valid timestamp
  • in the 2nd document the order_date element is empty
  • in the 3rd document the order_date element is missing
  • in the 4th document the order_date element contains an invalid timestamp value
create table mytest(doc XML);

insert into mytest values('
<order id="1">
 <customer>A</customer>
 <order_date>2010-04-21T22:09:03.75</order_date>
</order>');

insert into mytest values('
<order id="2">
 <customer>B</customer>
 <order_date></order_date>
</order>');

insert into mytest values('
<order id="3">
 <customer>C</customer>
</order>');

insert into mytest values('
<order id="4">
 <customer>D</customer>
 <order_date>May 15</order_date>
</order>');

If each order must have a valid timestamp in the order_date element then you can validate each document against an XML Schema and reject (or otherwise handle) those documents that do no contain a valid order_date.

However, there can be cases where you might prefer (or have to) to insert all documents without validation against a fixed schema. Then you need to handle the “dirty data” appropriately in your queries to avoid errors.

For example, the following query tries to return the order ID, customer ID, and order data from each document in our sample table. But, the query fails -as expected- when it encounters documents 2 or 4, because neither the empty string nor the string “May 15″ is a valid timestamp. Hence, casting the order_date to the SQL type “timestamp” in the XMLTABLE function fails:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

SQL16061N  The value "" cannot be constructed as, or cast (using an
implicit or explicit cast) to the data type "xs:dateTime".
Error QName=err:FORG0001.SQLSTATE=10608

The next query contains a predicate to only read orders 1 and 3. The query succeeds because the missing order_date element in order 3 automatically leads to a NULL:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       -

 2 record(s) selected.

If you prefer the missing element to produce a default value other than NULL, use the DEFAULT clause in the XMLTABLE function. Note that the default value that you choose must match the column date type, in this case timestanp. Here are a couple of examples:

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT '1900-01-01' path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       1900-01-01-00.00.00.000000

 2 record(s) selected.

select T.*
from mytest, XMLTABLE('$DOC/order[@id = (1,3)]'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp  DEFAULT current_timestamp path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
3           C       2010-05-09-15.24.14.156000

 2 record(s) selected.

Now how do we deal with the invalid timestamp values in documents 2 and 4? If the XML hasn’t been validated against a schema at insert time, a proper type validation at run time can be done with the XQuery construct castable. In the following example, the XQuery if-then-else expression returns the element order_date if its value is a proper timestamp, and otherwise it returns the empty sequence – which is the same as a missing order_date element and leads to a NULL or default value:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer     path '@id',
 cust_id    char(3)     path 'customer',
 order_date timestamp   path 'if (order_date castable as xs:dateTime)
                              then order_date
                              else ()'        ) as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

This query can be written in a more compact fashion, avoiding the explicit if-then-else construct. You can simply use castable in a XPath predicate in square brackets, in which the dot refers to the current element. The expression order_date[. castable as xs:dateTime] returns the order_date element if its castable to a timestamp (xs:dateTime), otherwise it returns the empty sequence.

select T.*
from mytest, XMLTABLE('$DOC/order'
        COLUMNS
          oid        integer     path '@id',
          cust_id    char(3)     path 'customer',
          order_date timestamp   path 'order_date[. castable as xs:dateTime]') as T;

OID         CUST_ID ORDER_DATE
----------- ------- --------------------------
1           A       2010-04-21-22.09.03.750000
2           B       -
3           C       -
4           D       -

 4 record(s) selected.

If you want to return the value of the order_date element regardless of whether it’s a valid timestamp or not, you can simply change the column type in the XMLTABLE function to VARCHAR, as in the next query. However, depending on your application this may or may not be desirable. Note that the order_date column in the result set contains an empty string in the 2nd row, but a NULL value in the 3rd row:

select T.*
from mytest, XMLTABLE('$DOC/order'
 COLUMNS
 oid        integer       path '@id',
 cust_id    char(3)       path 'customer',
 order_date varchar(25)   path 'order_date') as T;

OID         CUST_ID ORDER_DATE
----------- ------- -------------------------
1           A       2010-05-09T22:09:03.75
2           B
3           C       -
4           D       May 15

 4 record(s) selected.

You can read more about the XMLTABLE function in the developerWorks article “XMLTABLE By Example“, or in Chapter 7 and Chapter 9 of the DB2 pureXML Cookbook.


About these ads

One Response to “How to deal with dirty data”

  1. Bryan Says:

    Matthias,
    This will be very useful to customers using XMLTABLE with uncertain data.

    Combining your ideas for checking for a valid XML data type value and issuing a DEFAULT value when the data isn’t valid, one could use either the following:

    order_date timestamp default ’2000-01-01′ path ‘if (order_date castable as xs:dateTime) then order_date else ()’
    or
    order_date timestamp default ’2000-01-01′ path ‘order_date[. castable as xs:dateTime]‘

    One final alternative would be to put the default dateTime value into the ELSE clause of the first example above.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 50 other followers

%d bloggers like this: