What is an XML Schema and why should I care?
February 29, 2012
What is an XML Schema? Some of you may already know this, others don’t. So before I’m going to share some more technical information about XML Schemas in subsequent blog posts, I better get some of the basics out of the way first.
When you process and manage information in XML format, you can choose to use an XML Schema with your XML documents. Roughly speaking, an XML Schema can be used to define what you want your XML documents to look like. For example, in an XML Schema you can define:
- Which elements and attributes are allowed to occur in your XML documents
- How the elements can be or must be nested, or the order in which the elements must appear
- Which elements or attributes are mandatory vs. optional
- The number of times a given element can be repeated within a document (e.g. to allow for multiple phone numbers per customer, multiple items per order, etc.)
- The data types of the element and attribute values, such as xs:integer, xs:decimal, xs:string, etc.
- The namespaces that the elements belong to
- …and so on.
If you choose to create an XML Schema, it may define just some or all of the aspects listed above. The designer of the XML Schema can choose the degree to which the schema constraints the characteristics of the XML documents. For example, an XML Schema can be very loose and define only a few key features for your XML documents and allow for a lot of flexibility. Or it can be very strict to tightly control the XML data in every aspect. Or anything in between.
The use of an XML Schema is optional, i.e. an XML Schema is not required to store, index, query, or update XML documents. However, an XML Schema can be very useful to ensure that the XML documents that you receive or produce are compliant with certain structural rules that allow applications to process the XML. In other words, XML Schemas help you to enforce data quality.
Validation
If an document complies with a given XML Schema, then the document is said to be valid for this schema. A document might be valid for one schema but invalid for another schema. The process of testing an XML document for compliance with an XML Schema is called validation.
When an XML document is parsed by an XML parser, validation can be enabled as an optional part of the parsing process. Full validation of an XML document always requires XML parsing. For many documents and schemas, validation typically incurs only a small delta cost (in terms of CPU usage) on top of the cost of XML parsing.
What does an an XML Schema look like?
An XML Schema itself is an XML document! But, a very special document that needs to comply with very specific rules that are defined by -you guessed it!- another XML Schema, i.e. the schema for schemas.
Large XML schemas can consist of multiple schema documents that reference each other through import and include relationships. This allows you to compose an XML Schema out of smaller building blocks in a modular fashion.
I don’t want to go into the syntax details of XML Schemas here, but there are some useful resources available:
- The XML Schema Primer:
http://www.w3.org/TR/xmlschema-0/
- A tutorial:
http://www.w3schools.com/schema/default.asp
- Best practices for XML Schema design:
http://www.xfront.com/BestPracticesHomepage.html
When and why should I use an XML Schema?
Simply put, if you want to ensure data quality and detect XML documents that do not comply with an expected format, use an XML Schema and validate each document!
However, what if XML documents pass through multiple components of your IT infrastructure, such as a message queue, an application server, an enterprise service bus, and the database system? If these components do not modify the XML but merely read & route it, examine whether all of these components need to validate each document. For example, if the application server has already validated a document before insertion into a DB2 database, does the document need to be validated again in DB2? Maybe not, if you trust the application layer. Maybe yes, if you don’t.
An XML Schema is also often used as a “contract” between two or more parties that exchange XML documents. With this contract the parties agree on a specific format and structure of the XML messages that they send and receive, to ensure seamless operation.
Practically every vertical industry has defined XML Schemas to standardize XML message formats for the data processing in their industry. A good overview is given by the introduction of this article:
“Getting started with Industry Formats and Services with pureXML”:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0705malaika/
How can I validate XML documents in DB2?
Simple. First, you register one or multiple XML Schemas in the DB2 database. This can be done with CLP commands, stored procedures, or through API calls in the JDBC or .NET interface to DB2. After a schema is registered in DB2, you can use it to validate XML documents in DB2, typically when you insert, load, or update XML documents. You can enforce a single XML Schema for all XML documents in an XML column, or you can allow multiple XML Schemas per column. A database administrator can force automatic validation upon document insert, or allow applications to choose one of the previously registered schema for validation whenever a document inserted.
And… validation can also be done in SQL statements?
Yup. The SQL standard defines a function called XMLVALIDATE, which can be used for document validation in INSERT statement, UPDATE statements, triggers, stored procedures, and even in queries.
Here is a simple example of an INSERT statement that adds a row to a customer customer table, which consists of an integer ID column and an XML column called “doc”:
INSERT INTO customer(id, doc)
VALUES (?, XMLVALIDATE( ? ACCORDING TO XMLSCHEMA ID db2admin.custxsd) );
The id and the document are provided by parameter markers “?”, and the XMLVALIDATE function that is wrapped around the second parameter ensures validation against the XML Schema that has been regoistered under the identifier db2admin.custxsd.
If the inserted document is not compliant with the XML Schema, the INSERT statement fails with an appropriate error message. Similarly, the XMLVALIDATE function can also be used in the right-hand side of the SET clause of an UPDATE statement that modifies or replaces an XML document.
Ok, so much for now. In my next blog post we’ll go into more detail.
How to list the paths of all elements in an XML document?
February 4, 2012
A common question is how to obtain a list of all the elements and attributes that occur in an XML document. Producing such a list is what I call “XML profiling” and in a previous blog post I have discussed several SQL/XML queries that can do this.
An extension of this question is how to get the paths of all the elements and attributes in a document. This seemingly simple task is -unfortunately- not nearly as simple as one would think! XPath and XQuery do not have a function that takes a given element or attribute as input and returns the full path to that node.
The solution is to write a query that traverses the XML document level by level to collect the element names at every level and concatenate them appropriately to construct the paths for every elements and attributes at every level.
There are many ways in which this can be done. You can use XQuery or SQL/XML and you can choose whether to use recursion or not. Let’s look at a few examples.
First, let’s create a simple table with a small document that we can use in the examples:
create table mytable(xmldoc XML);
insert into mytable values(
'<Message>
<Type>Urgent</Type>
<Person id ="123">
<FirstName>Robert</FirstName>
<LastName>Tester</LastName>
</Person>
</Message>');
A first and straightforward solution is to start at the root of the document, then at the first level of child nodes, and then at the children of each these child nodes, and so on. For each element or attribute we construct the path by concatenating the path from the parent with the name of the element or attribute. We do this for all nodes at a given level in the tree and then move to the next level of the document.:
xquery
for $L1 in db2-fn:xmlcolumn("MYTABLE.XMLDOC")/*
let $L1path := fn:string-join( ($L1/local-name() ),"/" )
return (
$L1path,
for $L2 in $L1/(*,@*)
let $L2path := fn:string-join( ($L1path, $L2/local-name() ),"/" )
return (
$L2path,
for $L3 in $L2/(*,@*)
let $L3path := fn:string-join( ($L2path, $L3/local-name() ),"/" )
return (
$L3path,
for $L4 in $L3/(*,@*)
let $L4path := fn:string-join( ($L3path, $L4/local-name() ),"/" )
return (
$L4path,
for $L5 in $L4/(*,@*)
let $L5path := fn:string-join( ($L4path, $L5/local-name() ),"/" )
return ($L5path)))));
Message
Message/Type
Message/Person
Message/Person/id
Message/Person/FirstName
Message/Person/LastName
6 record(s) selected.
The obvious shortcoming of this query is that it assumes a maximum of 5 levels in the document. If your documents are deeper than this, you can easily extend the query so that it goes down to 10 or 20 levels, whatever you need. That’s maybe not very elegant, but it works if you can define an upper bound on the depths of your XML documents, which is usually possible.
You probably notice that the path Message/Person/id should actually be Message/Person/@id because “id” is an XML attribute. The query can enhanced to take care of such details. In the last two sample queries of my XML profiling post you have seen how to use the self::attribute() test for this purpose.
If you prefer a more elegant solution that does not require any assumption about the maximum depths of the XML documents, then you need to code a recursive query, either in XQuery or in SQL/XML. Let’s try SQL/XML for a change.
You may already be familiar with how recursive SQL works. If not, you can look at several existing examples. The basic idea is to use a WITH clause, also called “common table expression”, that contains a UNION ALL between the start of the processing and a recursive reference back to the common table expression itself. The following augments this approach with the XMLTABLE function that extracts nodes and node names from the XML:
WITH pathstable (name, node, xpath) AS (
SELECT x.name AS name, x.node AS xmlnode,'/' || x.name AS xpath
FROM mytable,
XMLTABLE('$XMLDOC/*'
COLUMNS
name varchar(30) PATH './local-name()',
node XML PATH '.') AS x
UNION ALL
SELECT y.name AS name, y.node AS xmlnode, xpath|| '/' || y.name AS xpath
FROM pathstable,
XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
COLUMNS
name varchar(30) PATH 'local-name()',
node XML PATH '.') AS y
) SELECT name, xpath
FROM pathstable;
NAME XPATH
------------------------------ -------------------------------
Message /Message
Type /Message/Type
Person /Message/Person
id /Message/Person/id
FirstName /Message/Person/FirstName
LastName /Message/Person/LastName
6 record(s) selected
If you want to list the element and attribute values for each path, then you can easily modify this query as follows:
WITH pathstable (name, node, xpath, value) AS (
SELECT x.name AS name, x.node AS xmlnode,
'/' || x.name AS xpath, x.value as value
FROM mytable,
XMLTABLE('$XMLDOC/*'
COLUMNS
name varchar(30) PATH './local-name()',
value varchar(20) PATH 'xs:string(.)',
node XML PATH '.') AS x
UNION ALL
SELECT y.name AS name, y.node AS xmlnode,
xpath|| '/' || y.name AS xpath, y.value as value
FROM pathstable,
XMLTABLE('$XMLNODE/(*,@*)' PASSING pathstable.node AS "XMLNODE"
COLUMNS
name varchar(30) PATH 'local-name()',
value varchar(20) PATH 'xs:string(.)',
node XML PATH '.') AS y
) SELECT xpath, value
FROM pathstable;
XPATH VALUE
------------------------------- --------------------
/Message UrgentRobertTester
/Message/Type Urgent
/Message/Person RobertTester
/Message/Person/id 123
/Message/Person/FirstName Robert
/Message/Person/LastName Tester
6 record(s) selected
A few things to note:
- The value of an XML element is defined at the concatenation of all text nodes in the subtree under that element. This explains the values that you see for /Message and /Message/Person in the example above.
- For longer paths you may need to increase the length of the VARCHAR(n) in the XMLTABLE function.
- In DB2 you may receive warning SQL0347W, which says that this query might recursively run into an infinite loop. But, this would only happen if your XML document was infinitely deep, which isn’t possible. So, you can safely ignore that warning.
Business Records in the 21st Century
January 21, 2012
Part 2 of our article “Data normalization reconsidered” is now available at
http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/index.html
The second part discusses alternatives to a traditional normalized relational representation of data. Such alternatives include for example XML, JSON, and RDF because they can often help you overcome normalization issues or improve schema flexibility, or both. In the 21st century, digitized business records are often created in XML to begin with, which makes XML an attractive choice as the database level storage format.
This article also contains a performance comparison between XML and relational data that was conducted for a real-world application scenario at a major international logistics company.
At the end of the article you find comparison tables that summarize the pros and cons of different data representations.
Data Normalization Reconsidered
January 8, 2012
Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.
In addition to its benefits, normalization also introduces several drawbacks:
- The insert of a single logical business record requires the insertion of multiple (often many) physical rows
- The retrieval of a single business record requires complex multi-way joins or a series of separate queries
- Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
- The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.
These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:
- Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
- Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
- Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
- De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.
Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:
- A single business record often maps naturally to a single XML document
- A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
- If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable
When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.
My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/index.html
The 2nd part will appear soon. Happy reading!
XQuery support in DB2 10 for z/OS
November 30, 2011
If you think that mainframe computers are old dinosaurs that only run ancient COBOL code – think again! Now mainframes also run XQuery!
While DB2 for Linux, UNIX, and Windows has been supporting XQuery and SQL/XML since Version 9.1 (released in 2006), DB2 9 for z/OS “only” supported XPath and SQL/XML.
I have put the word “only” in quotes because for many applications XPath and SQL/XML are fully sufficient. When you combine XPath expressions with SQL/XML functions such as XMLTABLE plus other SQL language constructs you can write very powerful XML queries and accomplish many of the same things that you can do with XQuery.
DB2 10 for z/OS has added a variety of new XML features such as node-level XML updates, XML-type parameters and variables in stored procedures and user-defined functions, and enhancements for XML Schemas and XML indexes.
With APARs PM47617 and PM47618, DB2 for z/OS now also supports XQuery within the SQL functions XMLTABLE, XMLQUERY, XMLEXISTS, and XMLMODIFY.
So what are the additional capabilities and benefits that XQuery provides? Examples include:
- You can compose new XML structures using direct element constructors
- You can use FLWOR expressions (for-let-where-order by-return) to iterate over and manipulate intermediate query results
- You can join and combine information from multiple XML documents
- You can use XQuery comma-expressions to construct and use new sequences of XML nodes
- You can code if-then-else logic to implement conditional expressions
- etc.
Let’s look at some examples.
Construct new XML structures with direct element and attribute constructors
The following query constructs a new order summary document from each order (XML document) that is selected from the “orders” table. New elements, such as <orderSummary> and <orderedItems> are constructed by providing the start and end tags explicitly. Similarly, the attribute orderNumber is also constructed explicitly. The content of the constructed elements and attributes is computed by XPath (or XQuery) expressions that extract selected information from each source document.
SELECT XMLQUERY('
<orderSummary orderNumber="{$po/order/orderNo/text()}">
<orderedItems>{$o/order/items/item/itemName}</orderedItems>
</orderSummary>'
PASSING orders.orderdoc AS "po")
FROM orders
WHERE ...
FLWOR expressions
The next query joins the tables “orders” and “items” on their XML columns “orderdoc” and “details”, respectively. The join predicate in the XMLEXISTS ensures that we find the items that match a given order and that we don’t produce a Cartesian product. For each pair of order document and item document, the FLWOR expression in the SELECT clause combines information from both documents into a new documents that contains the item name, the ordered quantity, and the item details.
SELECT XMLQUERY('for $o in $po/orders/items/item
for $i in $it/item
where $o/itemName = $i/name
return <orderdItem>
{$i/name}
{$o/item/quantity}
{$i/details}
</orderdItem>'
PASSING orders.orderdoc AS "po", items.details as "it")
FROM orders, items
WHERE
XMLEXISTS('$po/order/items/item[itemName = $it/item/name]'
PASSING orders.orderdoc AS "po", items.details as "it")
For more information on XQuery in DB2 10 for z/OS, here is a good place to continue reading:
TPoX 2.1 has been released!
November 15, 2011
First, what is TPoX? I have two answers to that question.
Answer 1:
TPoX, short for “Transaction Processing over XML”, is an XML database benchmark that executes and measures a multi-user XML workload. The workload contains XML queries (70%) as well as XML insert, update, and delete operations (30%). TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system.
The TPoX package contains:
- an XML data generator
- an extensible Workload Driver, written in Java
- three XML Schemas that define the XML structures
- a set of predefined transactions, which can be changed easily
TPoX has been developed by Intel and IBM, but is freely available and open source since 2007. A variety of TPoX performance results and other usage of TPoX have been reported.
Answer 2:
TPoX is a very flexible and extensible tool for performance testing of relational databases, XML databases, and other systems. For example, if you have a populated relational database you can use the TPoX workload driver to parameterize, execute, and measure plain old SQL transactions with hundreds of simulated database users. I have used TPoX for a lot of relational performance testing, because it’s so easy to setup and measure concurrent workloads. The workload driver reports throughput, min/ax/avg response times, percentiles and confidence intervals for response times, and other useful metrics. Oh, and by the way, TPoX happens to include an XML data generator and a set of sample XML transactions, in case you’re interested in XML database performance.
In the latest release, TPoX 2.1, we have further enhanced the extensibility of the TPoX Workload Driver. The XML data and XML transactions are still the same.
Some of the enhancements in TPoX 2.1 include:
- higher precision in reported response times
- proper handling and counting of deadlocks, if any
- easier post-processing of results in Excel or other spreadsheets software
- new types of workload parameters such as random dates, random timestamps, sequences, etc.
- in addition to SQL, SQL/XML, and XQuery, transactions can now be also supplied as Java plugins, allowing you to run and measure anything (concurrently!) that you can code in Java, such as:
- complex transactions that include application logic
- calls to web services or message queues
- obtaining data from RSS or ATOM feeds
- transactions against databases or content repositories that do not have a JDBC interface
We have already found these extensions extremely valuable for some of our own performance testing, and we’re happy to share them. You can download TPoX 2.1 (free, open source) and find more detailed information in the release notes as well as the TPoX documentation that is included in the download.
XML in Las Vegas !
October 18, 2011
The annual Information on Demand conference in Las Vegas (Oct 23-27, 2011) is always a great venue to get updated on the latest topics in data management. The IOD conference offers a broad range of technical sessions, business sessions, hands-on labs, and lots of networking opportunities. More than 200 customer speakers will present their first-hand experiences with IBM Software.
The conference is in the Mandalay Bay conventation center. Various topics around DB2 pureXML are covered in presentations and hands-on labs. Some of them are listed here:
2088A – DB2 pureXML for Beginners
Mon, Oct 24, 2011, 3:45 PM – 5:00 PM
1120B – DB2 pureXML: Develop Your Application Prototype in Minutes, not Days
Tue, Oct 25, 2011, 9:45 AM – 12:45 PM (hands-on lab)
1930A – DB2 for z/OS SOA Solutions Powered by pureXML and DataPower
Wed, Oct 26, 2011, 3:15 PM – 4:15 PM
2691A – Architecting with IBM pureXML and Temporal Data in DB2 10 for z/OS
Thu, Oct 27, 2011, 8:15 AM – 9:30 AM
2087B – Get Connected: Publishing Relational Data as XML Messages
Thu, Oct 27, 2011, 3:30 PM – 4:30 PM
If you’re going to IOD, don’t miss these sessions!

