Native XML support in a database such as DB2 allows you to store, index, query, validate, and update XML documents efficiently. In particular it’s the database support for XQuery and SQL/XML that enables powerful queries and updates of stored XML data, or XML and relational data combined.

However, XML capabilities in the database alone are often not sufficient to manage XML across the enterprise. Integrated XML support is also required in application development tools, ETL products such as DataStage, reporting tools like Cognos, and other critical parts of the IT landscape, such as the application server.

To that end, the WebSphere XML Feature Pack provides developers with key XML technologies to efficiently manipulate XML in WebSphere Java applications. It offers support for:

  • XPath 2.0
  • XQuery 1.0
  • XSLT 2.0
  • and earlier versions of XPath and XSLT, if needed

In essence, you can write Java applications for deployment in WebSphere Application Server and within your Java code use XPath, XQuery, or XSLT directly to work with XML, including transient XML messages. One of the key advantages is that you can apply XPath and XQuery expressions directly to your XML data instead of coding equivalent operations much more tediously via a low-level SAX or DOM API or other object models. This has several benefits:

  • Simplified XML application development
  • Increased developer productivity
  • XML operations are easier to understand and maintain over time

If you think about it, these are the same reasons why you implement many applications in Java (or similar) rather than Assembler: you use the comfort and expressiveness of a higher level language (XQuery) to avoid the complexity, tediousness, and “error proneness” of a lower-level approach.

A second set of benefits emerges when you look at the handling of XML across database and application server. On the database side, the benefits of storing and processing XML natively instead of shredding them to relational format are well-known:

  • Reduced complexity, because shredding any but simple XML documents is a complicated task
  • Increased flexibility, because native XML allows for document variability and schema evolution, whereas shredding relies on a static mapping that breaks easily when new document variations are encountered
  • Improved performance, because the conversion between two different data representations is avoided
  • Guaranteed document fidelity, because the original XML business record stays in its XML format

The very same benefits can be  realized in the middle tier by processing XML natively in the application server. With the XQuery, XPath, and XSLT support in the WebSphere XML Feature Pack you can manipulate XML in its original XML format without having to convert it to a different representation, such as an object structure.

With increasing complexity or variability of your XML documents, converting XML to Java objects becomes increasingly complicated, inflexible, and slow. For example, using JAXB to map XML to objects requires Java classes to be generated that constitute a fixed and inflexible mapping.

And finally, treating XML as XML in the database and the application server realizes all of the above-mentioned benefits across the middle tier and the database by eliminating any impedance mismatch and data conversion between the tiers.

Here are some resources where you can learn more about the IBM WebSphere Application Server V7.0 Feature Pack for XML and see some concrete examples of its use.

Product page and download:
http://www.ibm.com/websphere/was/xml

The IBM Red Paper: “Getting Started with the WebSphere Application Server Feature Pack for XML”, includes many examples:
http://www.redbooks.ibm.com/abstracts/redp4654.html?Open

The XML Feature Pack documentation:
http://publib.boulder.ibm.com/infocenter/wasinfo/fep/index.jsp?topic=/com.ibm.websphere.xmlfep.multiplatform.doc/info/ae/ae/welcome_fepxml.html

And a collection of links to further information:
http://webspherecommunity.blogspot.com/2009/11/websphere-application-server-feature.html

Advertisements

Relational databases have supported SQL stored procedures for a log time. The benefits of SQL procedures are well-known and include:

  • Reduced coding labor due to the creation of reusable processing modules
  • Richer processing capabilities in the databases by defining custom data processing logic
  • Improved performance and reduced network traffic because stored procedures are executed close to the data, i.e. in the database engine itself

With the introduction of XML support in relational databases, several database products (such as DB2) allow you to access and manipulate not only relational data but also XML in stored procedures. This is made possible through:

(a) the introduction of the XML data type in the SQL type system, and

(b) several XML functions in the SQL language such as XMLQUERY, XMLEXISTS, and XMLTABLE that can contain XPath or XQuery expressions.

The picture below provides an example of some of the basic capabilities for XML handling stored procedures:

  • Input and output parameters of type XML. In DB2, if an XML document already exists in a parsed format in the database then passing it to the XML parameter of the procedure does not require the document to be reparsed. XML procedures can operate directly on the parsed representation of the XML. Also, if an application calls a stored procedure and passes a textual XML document to the XML parameter, the document gets parsed only once upon entering the procedure regardless of how often the XML parameter is used in the body of the procedure.
  • XML type variables. XML variables can be assigned values of type XML, such as a document or some document fragment, even a single XML element or atomic value. XML variables hold XML in a parsed format, which is important for performance. For example, the second and last SET statement in the picture below uses an XQuery FLWOR expression to read a certain address from the input document and assigns it to the XML variable “address”.
  • XML value extraction. The XMLQUERY function can contain XPath or XQuery expression to extract values from a document. In the first SET statement in the picture below, the XMLQUERY function on the right-hand side takes the XML input parameter “custDoc” as context and uses an XPath to extract the @Cid attribute from the document. The XMLCAST function converts the attribute to the SQL type INTEGER, so that it can be assigned to the SQL variable “id”.
  • XML conditions. If-then-else statements in the SQL procedure language can use XML conditions in the IF-clause to decide what to do next in the control flow of the stored procedure.

(Click the picture for full size)

The examples above are certainly not an exhaustive list of what you can do with XML in stored procedure. For example, while the stored procedure above only operates on a single document that is passed in via the XML parameter, stored procedures can also use XQuery or SQL/XML to access sets of XML documents that are stored in XML columns.

Stored procedures can also compose, prepare, and execute XQuery or SQL/XML statements dynamically. You can define cursors over the result sequence of an XQuery and process each returned item one by one in a loop. You can also construct, split, shred, modify, merge, compare, or validate XML documents in stored procedures.

Of course, all the error and exception handling that is available for stored procedures can also be used with any XML operations. In summary, you can code sophisticated XML processing logic (or hybrid XML/relational processing logic) with the familiar concepts of SQL stored procesures.

Many more examples of XML processing in DB2 Stored Procedures can be found in Chapter 18 of the DB2 pureXML Cookbook.

XML is a flexible and extensible data format and is often used for messages or other document types whose structure can have multiple different variations or evolves over time. As a result, you might not be dealing with a single fixed document schema but with significant schema variability, or schema diversity.

Many XML database systems allow you to store documents for many different schemas (or different version of the same schema) in a single collection. This is also true for DB2, where a collection is an XML column that is not necessarily tied to a single XML schema. Hence, applications can store, index, query, validate, and update documents for different schemas in a single XML column.

When you validate different documents in an XML column against different XML schemas, DB2 remembers which document was validated against which schema. When you later retrieve a specific document, you can easily determine which schema it was validated against when it was inserted or last updated.

A common question is: How to effectively index XML documents when there is schema variability or schema diversity?

At a high level, DB2 provides three options:

  • Define various individual XML indexes for the various elements and attributes that you require to be indexed for good performance
  • Index all elements and/or all attributes
    • Creating an XML index on the path //text()   indexes all text nodes, i.e. basically all element values
    • Creating an XML index on the path //@*  indexes all attributes
  • Use a full-text index that indexes all elements and attributes, using DB2 Text search

The good thing is that DB2 allows you to “index everything” but it doesn’t force you to do so – unlike some other database systems. This is important, because indexing all elements and all attributes in all documents requires a lot of space and impacts the performance of insert, update, and delete operations. Therefore, indexing everything should be a conscious choice.

Although full-text indexing with DB2 Text search can address schema variability, I think the decision to use Text Search should be predominantly based on the type of searches that need to be performed. If the applications commonly perform full-text search for phrases and substrings, or maybe even require fuzzy search, stemming, and other fancy text features, then DB2 Text Search should be used. However, if the data is more structural than textual in nature, and if most predicates use regular =, >, >=, etc. operators, then I would usually prefer a set of regular XML indexes.

A key benefit of XML indexes in DB2 is that they contain index entries only for those document that they match. For example:

Let’s say we have 10 Million XML documents in an XML column, for 10 different schemas, approx. 1M documents per schema. If we define an XML index on an XML element (XPath) that exists in only 1 of the 10 schemas, then this index will hold entries only for 1M documents. If this element is an optional element that occurs in only 20% of the documents for this 1 schema, then this index will hold entries only for 200,000 documents. Thus, this index will be substantially smaller than an index that contains entries for all documents.

As a result, defining a number of specific XML indexes for various elements and attributes is very space efficient! So, if you can identify which nodes need to be indexed for good performance, you can index them in a space-conscious manner.

The index maintenance cost for insert/update/delete operations follows the same guideline as in the relational world: the more indexes you define the more CPU and I/O is being used for update the indexes. But, let’s say I have defined 50 XML indexes on an XML column, and the document that I am inserting contains elements and attributes that match only 5 of those indexes, then only those 5 indexes will be updated when the document is inserted.

A large number of XML indexes defined on a single XML column is not a concern for query performance. The DB2 query compiler is very efficient at determining quickly which index can (and should) be used for a given query.

Also note that XML indexes in DB2 can be defined with wildcards, which gives you a lot of option between defining an index for just one element and defining an XML for all elements. Some examples:

  • Let’s say you are indexing purchase orders and you know that all of most the customer address information should be indexed. If you know that the path to the customer address information is /purchaseorder/customer/address  but the subtree under the <address> element is subject to variability, you could define an index on /purchaseorder/customer/address//text(), to index all element values in the subtree under <address>.
  • If you know that there is only one level of elements under address, you could define an index on /purchaseorder/customer/address/* to index all of these elements, and maybe another index on /purchaseorder/customer/address/*/@* for the attributes of these elements, if any.
  • If you know that you need to index the <zipcode> element under <address>, but the <address> can be in different subtrees for different documents, you could define an index on //address/zipcode or  /purchaseorder//address/zipcode.

For more information on XML indexes, see

http://www.ibm.com/developerworks/data/library/techarticle/dm-0611nicola/

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.xml/src/tpc/db2z_indexxml.htm (DB2 for z/OS)

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0024039.html (DB2 for Linux, UNIX, Windows)