Recently I received an interesting question about a date that is not a really date. Take the following XML element as an exanple:

<StartTime>Monday Mar 23 2009 10:20:24.19</StartTime>

This element contains a value that describes a date and a time. The problem is that the string “Monday Mar 23 2009 10:20:24.19” is really only a string and not a valid representation of a date or a timestamp in the world of XML.

The XML Schema specification defines a specific data type for timestamps: xs:dateTime. According to that spec, literal values must have a specific format if they want to be of type xs:dateTime, or better: if you want to process them as xs:dateTime with a language such as XPath or XQuery. For example, the value 2002-10-10T12:00:00 is a valid xs:dateTime value.

Unless you deal with negative dates, xs:dateTime values must be of the form yyyy-mm-dd’T’hh.mm.ss, plus optional fractional seconds and an optional timezone which I am omitting here for simplicity. The character ‘T’ is required as a delimiter between the date and the time portion of the literal.

So, how can you convert the original string value into a valid xs:dateTime value? Also, how can can you convert that string into a proper SQL timestamp?

Let’s start with an XQuery solution to convert the string above to xs:dateTime, and then we look at a SQL/XML solution to produce a SQL timestamp. Both approaches assume, for simplicity, that the input data is a small document in an XML column:

create table mytest(doc XML);
insert into mytest values ('<root><StartTime>Monday Mar 23 2009 10:20:24.19</StartTime></root>');

XQuery

The main trick is to break the string into its pieces for year, months, day, time, and so on. This can be done with the fn:tokenize function that I wrote about earlier. Go back to my post about fn:tokenize if you need to refresh your memory on that function.

Here is a first XQuery that is not quite perfect yet, but a good start:

xquery 
for $t in db2-fn:xmlcolumn('MYTEST.DOC')/root/StartTime
let $pieces := fn:tokenize($t, " ")
let $timestamp := fn:concat($pieces[4],"-",$pieces[2],"-",$pieces[3],"T",$pieces[5])
return $timestamp;
----------------------------------
2009-Mar-23T10:20:24.19
 1 record(s) selected.

In the “for” clause,  the StartTime element gets assigned to the variable $t. In the first “let” clause, the function fn:tokenize splits the string value of StartTime along the spaces in the string. The second parameter of the fn:tokenize function specifies that the blank character is a delimiter in the string, and the string should be tokenized along that delimiter. The variable $piece now contains a sequence of tokens.

In the second “let” clause we use the function fn:concat to concatenate the tokens in a different order and with different delimiters. We start with the 4th token from the input string (year), append a dash, then append the 2nd token (month), then another dash, then the 3rd token (day), and so on.

The result is almost what we want. The remaining task is to convert the month “Mar” into a “03”. And of course we need to be able to perform this conversion for any of the 12 months that can possibly occur. For example, you could think of a 12-way if-then-else expression, but that would be rather ugly.

A more elegant approach is to build simple “month map” from which we pick the right month. Look at the two additional “let” clauses in bold below. The “let $map” defines a constant piece of XML that maps the three-letter month names to the numbers 01, 02, 03, and so on. Then the “let $month” performs a lookup in that map to pick the element whose name matches $piece[2], which is “Mar” in our example, and the /text() value of the element <Mar> is “03”. This “03” is then used in the concat of the final result:

xquery 
for $t in db2-fn:xmlcolumn('MYTEST.DOC')/root/StartTime
let $pieces := fn:tokenize($t, " ")
let $map := <map><Jan>01</Jan><Feb>02</Feb><Mar>03</Mar><Apr>04</Apr>. . . </map>
let $month := $map/*[name()=$pieces[2]]/text()
let $timestamp := fn:concat($pieces[4],"-", $month,"-",$pieces[3],"T",$pieces[5])
return xs:dateTime($timestamp);
----------------------------------
2009-03-23T10:20:24.19 
 1 record(s) selected.

Also n0te that the cast to xs:dateTime in the “return” clause proves that we have produced a valid xs:dateTime value.

SQL/XML

Now that we have the XQuery solution above, it is trival to plug this into an SQL/XML query to produce a SQL timestamp:

SELECT XMLCAST ( XMLQUERY('for $t in $DOC/root/StartTime
   let $pieces := fn:tokenize($t, " ")
   let $map := <map><Jan>01</Jan><Feb>02</Feb><Mar>03</Mar><Apr>04</Apr> </map>
   let $month := $map/*[name()=$pieces[2]]/text()
   let $timestamp := fn:concat($pieces[4],"-", $month,"-",$pieces[3],"T",$pieces[5])
   return $timestamp')  AS TIMESTAMP)
FROM mytest;
--------------------------
2009-03-23-10.20.24.190000

 1 record(s) selected.

The XMLQUERY function contains the same XQuery expression as before to produce an xs:dateTime value, which the XMLCAST function can than cast to the SQL type TIMESTAMP. Simple as that.

As always, there are multiple solutions to every problem, so here is another one. This query use fn:tokenize to split the string into its five pieces and each piece is returned by the XMLTABLE function as a separate relational column. The expression in the SELECT clause concatenates the pieces in the right order, and uses a good old SQL CASE expression to map the three-latter month name to the equivalent two-digit number:

SELECT year || '-' || 
   (CASE
      WHEN month = 'Jan' THEN '01'
      WHEN month = 'Feb' THEN '02'
      WHEN month = 'Mar' THEN '03'
      WHEN month = 'Apr' THEN '04'
      WHEN month = 'May' THEN '05'
      WHEN month = 'Jun' THEN '06'
      WHEN month = 'Jul' THEN '07'
      WHEN month = 'Aug' THEN '08'
      WHEN month = 'Sep' THEN '09'
      WHEN month = 'Oct' THEN '10'
      WHEN month = 'Nov' THEN '11'
      WHEN month = 'Dec' THEN '12'        
      ELSE 'unknown'
    END)      
  || '-' || day  || '-' || time
FROM mytest, XMLTABLE('$DOC/root' 
 COLUMNS
   dayOfweek  VARCHAR(20)  PATH 'fn:tokenize(StartTime, " ")[1]',
   month      CHAR(3)      PATH 'fn:tokenize(StartTime, " ")[2]',
   day        INTEGER      PATH 'fn:tokenize(StartTime, " ")[3]',
   year       INTEGER      PATH 'fn:tokenize(StartTime, " ")[4]',
   time       VARCHAR(32)  PATH 'fn:tokenize(StartTime, " ")[5]'          
 ) T;

----------------------------------------------------------------
2009-03-23-10:20:24.19

 1 record(s) selected.

Which of these solutions do you like best?

TPoX stands for “Transaction Processing over XML” and is an XML database benchmark that Intel and IBM have developed several years ago and then released as open source. TPoX is an application-level benchmark, which means that it simulates an actual application that performs queries, inserts, updates and deletes in a concurrent multi-user workload.

Of course, no single benchmark can represent the characteristics of all types of database applications. In fact, a benchmark should not even try to do that because it would end up being an unrealistic workload. Take the traditional relational database benchmarks as an example: the TPC-C benchmark is an OLTP workload while the TPC-H benchmark mimicks a decision support scenario, two very different usage patterns for relational databases, hence two different benchmarks.

TPoX is designed to be an XML transaction processing (OLTP) benchmark with data-oriented XML structures, very large numbers of relatively small XML documents (1kb to 20kb), short read/write transactions, and a high degree of concurrency.

Many characteristics of real-world financial XML applications have gone into the design of TPoX. For example, the “order” messages are defined by the FIXML standard that is used in the financial industry. The actual FIXML schema is used for document validation in TPoX. To give you an idea of its complexity: FIXML Version 4.4 consists of 41 schema documents (XSD files) and defines more than 3600 elements and attributes that can appear in FIXML messages.

Obviously, TPoX does not intend to be a content-oriented XML benchmark where full-text search and other content-centric operations would be exercised. Another XML benchmark is needed to cover that space.

Since 2006 both Intel and IBM have published TPoX results for various different scale factors and hardware platforms. In 2009 and 2010 Intel has continued to run the TPoX benchmark internally to evaluate their ever evolving fleet of Xeon processors. Four benchmark results from that period have now been published on the TPoX web site, showcasing the performance of the Intel Xeon processors X7560, X5680, X5570, and X7460.

These benchmark results from Intel are vary valuable in several ways:

  • The results prove XML processing performance on different hardware choices, ranging from 2 x 4 cores all the way to 4 x 8 cores in a single server.
  • They quantify the performance gain that each new generation of Xeon processors provides over the previous.
  • These TPoX benchmarks continuously push hardware, operating system, and database server to higher levels of performance. This helps to identify and exploit opportunities for performance enhancements which ultimately drives innovation, advances technology, and improves products quality.
  • The range of TPoX results serves as yard stick for companies that need to choose hardware and software for their XML database needs.

Other software vendors (including Oracle) as well as Universities have also been using TPoX for a variety of purposes. Although some database vendors other than IBM have been using TPoX, to the best of my knowledge they haven’t disclosed the details of their testing, such as the exact results together with detailed information about how the hardware and database were configured.