Good and bad XML design

December 4, 2012

Whenever people design things –such as houses, software systems, database schemas, or even XML structures– there are good and not so good design options.  Roughly speaking, a design is usually considered “good” if it allows for easy and (cost-)efficient use and implementation of whatever is being built.

For XML documents we have seen various cases of “good” and “bad” designs over the years. Previously I wrote about name/value pair design for XML documents, which often is not a good choice.

Recently I received a question regarding another design choice, which I would like to share here in a simplified form.

The question was about an XML message format to transmit a list of order numbers and the corresponding order system identifiers. Two alternative XML structures were proposed:

Option A:

<OrderList>
  <Order>
    <OrderNumber>A0000001</OrderNumber>
    <OrderSystem>ABC</OrderSystem>
  </Order>
  <Order>
    <OrderNumber>B0000001</OrderNumber>
    <OrderSystem>XP1</OrderSystem>
  </Order>
  <Order>
    <OrderNumber>C0000001</OrderNumber>
    <OrderSystem>Q-9</OrderSystem>
  </Order>
</OrderList>

Option B:

<OrderList>
    <OrderNumber>A0000001</OrderNumber>
    <OrderSystem>ABC</OrderSystem>
    <OrderNumber>B0000001</OrderNumber>
    <OrderSystem>XP1</OrderSystem>
    <OrderNumber>C0000001</OrderNumber>
    <OrderSystem>Q-9</OrderSystem>
</OrderList>

The motivation for option B might be a smaller message size. One might argue that XML elements are by definition ordered, and if we know that the message consists of pairs of order number and order system, then XML format B is as easy to interpret as format A. Or, is it not?

The problem that I see with option B is that the absence of the <Order> elements gives up the explicit structure of the data and instead requires implicit knowledge, i.e. one must know that the OrderNumber at position i belongs to the OrderSystem at position i+1.

In my view, this design is against the principles of XML. In XML, the tags and their nesting should describe the data explicitly. XML is meant to be a hierarchical data format so that data items that belong together are grouped under the same parent (as in option A). After all, that is the benefit of XML over a flat file format.

I also think that using the explicit structure of option A is less error-prone than using option B which, requires implicit knowledge based on element positions.

How about queryability?

It turns out that XML format A is considerably easier to query than option B. Let’s assume you need to code a SQL/XML query to list the order numbers and systems. For that exercise we assume that the messages are stored in a table “mytable” with XML column “doc”.

The query for XML option A is very simple, using the XMLTABLE function in its most basic form. The row-generating XPath expression $DOC/OrderList/Order iterates over the orders, and the column expressions extract the order number and system.

http://www.ibm.com/developerworks/data/library/techarticle/dm-0708nicola/

-- Query for XML option A:

SELECT OrderNumber, OrderSystem
FROM mytable,
     XMLTABLE ('$DOC/OrderList/Order'
       COLUMNS
         OrderSystem VARCHAR(20) PATH 'OrderSystem',
         OrderNumber VARCHAR(20) PATH 'OrderNumber' );

ORDERNUMBER          ORDERSYSTEM
-------------------- --------------------
A0000001             ABC
B0000001             XP1
C0000001             Q-9

3 record(s) selected.

How can you produce the same result set from XML format B? This is remarkably tricky! You need to iterate over the child elements and keep track of which element is at position i versus i+1 to produce the proper pairs in the output.

There are likely several ways to code the query for XML option B, and you see one solution below. It seems hard to query the elements by name alone because the important structural information is by position. Hence, this query iterates over the positions 1 to N, where N is the number of elements. For each even position 2, 4, 6, etc. the query produces the pair of the current ($pos) and the previous ($pos – 1) element to construct the proper output.

-- Query for XML option B:

SELECT OrderNumber, OrderSystem
FROM mytable,
     XMLTABLE('for $pos in (1 to count($DOC/OrderList/*))
               where $pos mod 2 = 0
               return  <pair>
                         {$DOC/OrderList/*[$pos - 1]}
                         {$DOC/OrderList/*[$pos]}
                       </pair>'
       COLUMNS
         OrderNumber VARCHAR(20) PATH 'OrderNumber',
         OrderSystem VARCHAR(20) PATH 'OrderSystem' );

OrderSYSTEM          OrderNUMBER
-------------------- --------------------
ABC                  A0000001
XP1                  B0000001
Q-9                  C0000001

3 record(s) selected.

This query is not only more complex but also potentially less efficient. Note that the return clause of the FLOWR expresion constructs the explicit pairing that it missing in XML format B.

Let me know if you can think of a simpler query to produce the same result from XML format B.

Next, imagine a message that doesn’t have just 2 fields per order but maybe 10 or 100 fields per order. Then option B becomes increasingly more ugly.

Or worse, what if there is a schema change that allows one order to have a different number of fields than the next? Then the query that we coded for XML format B will no longer work because it was based on a specific structural assumption that is no longer true. In contrast, our query for XML format A will work even if you add additional fields to some of the orders in the message.

In summary, the benefits for XML option A include the following:

  • Explicit structure rather than implicit positional format
  • Easier and more efficient to query (query by name rather query by postion)
  • More resilient to schema changes

I think that these benefits far outweigh the concern that XML format A is larger. Also, you could chose to send or store the XML in a compressed format to greatly alleviate space concerns.

About these ads

2 Responses to “Good and bad XML design”

  1. Stephan H. Wissel Says:

    I would actually use a structure like this (comment would eat angle brackets, so using square instead):
    [OrderList]
    [Order number="A0000001"]
    [OrderSystem]ABC[/OrderSystem]
    [/Order]
    [/OrderList]

    (presuming there is more than OrderSystem in an Order). Is using attributes an issue for PureXML? How would the queries look like?


    • Hi Stephan,

      you are right, the decision when to use attributes vs. elements is yet another design choice that I didn’t go into. Maybe subject for a future blog post.

      The structure you propose has many of the same benefits as option A above.

      DB2 pureXML handles XML attributes very easily, just like elements. This applies to defining indexes, performing XML queries, XML updates, validation, and all other XML capabilities in DB2.

      Here is what the query would look like:


      SELECT OrderNumber, OrderSystem
      FROM mytable,
      XMLTABLE ('$DOC/OrderList/Order'
      COLUMNS
      OrderSystem VARCHAR(20) PATH 'OrderSystem',
      OrderNumber VARCHAR(20) PATH '@number' );

      The only change here is the XPath to the attribute @number. The query result is the same as before.

      Thanks,

      Matthias


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: