XML ETL

March 26, 2010

The handling of XML data has become increasingly important in ETL (extract, transform, and load) processes. Here are some typical use cases:

For example,  you might be accumulating XML messages  in a  transactional database and may need to extract selected values from these messages and add them to a relational data warehouse.

In other cases you might have to move XML data from one system to another and wish to transform the XML documents in the process. Depending on the predominant usage patterns in the target system, you might want to cut large documents into smaller XML fragments, extract only specific sections from each XML document, or rename XML elements. The source as well as the target of such transformations can be XML files in the file system or tables (with XML columns) in a DB2 database.

You can certainly code many useful transformations in SQL/XML, but you might prefer to use a visual tool to design ETL data flows. One  such tool is the IBM InfoSphere Warehouse Design Studio.

A new article on developerworks describes how to use the InfoSphere Design Studio to design ETL flows that involve XML data. As a concrete example, the article walks you through the steps of building a data flow that reads from an XML source table and populates two target tables in a data warehouse. One of the target tables contains only relational data, while the other contains both relational and XML data. Read the article here:

Using IBM InfoSphere Warehouse Design Studio with pureXML data, Part 1: Create an ETL data flow to populate a hybrid data warehouse.

This article is a two-part series and I’ll let you know when part 2 is published.

Advertisements

Due to popular demand there will be several additional DB2 pureXML bootcamps.

A DB2 pureXML bootcamp is a free 2-day (or 3-day) course consisting of presentations and hands-on labs where attendees can gain or deepen their DB2 pureXML skills. It’s for application developers, architects, DBAs, and other interested parties.

One bootcamp will be held in London, UK, on the 27th and 28th of April 2010. You can find more information about the pureXML bootcamp in London, including a link to register, here:
http://www.ibm.com/developerworks/wikis/display/db2xml/BootCampsUK

Also, there have been requests to hold a DB2 pureXML bootcamp in Dallas, Texas.  The tentative time frame is July or August. If you are interested in attending a pureXML bootcamp in Dallas or have any questions, please send me an email or contact my colleague Mary at desisto@us.ibm.com.

When you develop XML applications, you want good tools to design and debug XML artifacts such as XML documents, XML schemas, XQuery queries, SQL/XML statements, XSL style sheets, and other things. On occasions I have seen people use plain text editors such as vi or Notepad for this, but that’s not everyone’s cup of tea. I confess: when I write XQuery or SQL/XML I often use a simple text editor too, because it’s quick and simple. But, if I need to edit an XML Schema or XML document I certainly want a tool that’s XML-aware.

There are varies XML tools that work well with DB2 pureXML. For example, the IBM Data Studio IDE includes features for editing XML Schemas or documents and for designing XML queries. Third party tools that integrate well with DB2 pureXML include Altova XMLSpy, Stylus Studio, and <oXygen/>.

For example, XMLSpy’s integration with DB2 pureXML includes the following capabilities:

  • Edit, debug, and profile XQuery statements against XML data in DB2 databases. Query results are then available for further manipulation in XMLSpy
  • Visualize the database structure and query DB2 tables using SQL, SQL/XML, and XQuery
  • Read XML data from DB2, edit it, and store it back in DB2 with optional schema validation
  • Manage XML Schemas in DB2’s XML Schema Repository. For example, you can design new schemas in XMLSpy and register them in DB2, or read existing XML Schemas from DB2, edit them, and save them back into DB2
  • Transform XML data for use in other applications
  • and more…

If you want to learn more about using XMLSpy with DB2 pureXML, I recommend the following resources:

Website: Altova Tools for DB2 pureXML:
http://www.altova.com/xmlspy/db2-xml-editor.html
http://www.altova.com/solutions/ibm-db2-tools.html
http://www.altova.com/IBM-AltovaPartnership_120506.html

White Paper: Integration of Altova Tools with IBM DB2 pureXML:
http://www.altova.com/whitepapers/ibm.pdf

Tutorial: Using the Altova Tools with IBM DB2 pureXML:
http://www.ibm.com/developerworks/db2/library/long/dm-0712kogan/

The SQL language standard includes XML as one of the SQL data types, which allows you to have columns of type XML in tables and result sets. The latest JDBC standard (JDBC 4.0) now includes an interface and data type called SQLXML that correspond to the XML data type in the database server.   

Of course you can use JDBC 3.0 to send and retrieve XML data to/from a database as a String, Clob, BinaryStream, CharacterStream, etc. with the traditional getter and setter methods. In addition to that, the new SQLXML interface in JDBC 4.0 enables you to use a DOM or SAX representation of your XML data when you send or receive it to/from the database. This can be very helpful if your application already uses the DOM or SAX API to manipulate XML.  

The SQLXML interface in JDBC 4.0 works for both SQL/XML and XQuery. So even if you use XQuery without any SQL involved, you can use JDBC to conveniently retrieve and process the result sequence that is produced by an XQuery. And if you use SQL/XML to insert, update, or retrieve XML and relational data at the same time, then JDBC is also very well suited for your needs. 

Let’s look at an example. First, be aware that JDBC 4.0 requires Java 6. To use JDBC 4.0 with DB2 you also need JCC4 (Java Common Client 4, db2jcc4.jar).  

 Assume you have a target table such as the following:  

create table mycustomer(info XML)

With JDBC 3.0 you would insert serialized XML from your Java application, using setBinaryStream(), setBlob(), or setString, etc. to bind your XML document to the parameter marker of an INSERT statement, like this:  

String sql = "INSERT INTO mycustomer(info) VALUES (?)";
PreparedStatement stmt = connection.prepareStatement(sql);
stmt.setBinaryStream(....);
stmt.execute();

But, assume your application already holds an XML document as a DOM in a variable called “xmldocumentDOM”. Then you can bind the DOM to the parameter marker of the insert statement as follows:  

String sql = "INSERT INTO mycustomer(info) VALUES (?)";
PreparedStatement stmt = connection.prepareStatement(sql); 

SQLXML sqlxml = con.createSQLXML();
DOMResult domResult=sqlxml.setResult(DOMResult.class)
domResult.setNode(xmldocumentDOM); 

stmt.setSQLXML(1, sqlxml);
stmt.execute();

You can also use the SQLXML interface to fetch an XML column value (a document) from a JDBC resultSet directly into a DOM or SAX for further manipulation in your application, if needed.  You can call the getSQLXML() method on the resultSet object and then use the DOM or SAX API to operate on the retrieved XML document: 

ResultSet resultSet
    = statement.executeQuery("SELECT info FROM mycustomer WHERE ....");
resultSet.next(); 

// retrieve an SQLXML object from the ResultSet:
SQLXML sqlxml = resultSet.getSQLXML(1); // 1 is the column index 
// obtain DOM tree from SQLXML object:
DOMSource source = sqlxml.getSource(DOMSource.class); 

// create document object from DOMSource:
Document document = (Document) source.getNode(); 

// now you can process the DOM tree as usual...

You can find more information and examples here: