XML ETL with DataStage 8.5
April 13, 2011
Since XML has become so pervasive in enterprise computing and service-oriented architectures, it is a given that XML capabilities are required in all parts of an IT environment. Closely related to the area of databases is ETL - to extract, transform, and load information, e.g. to populate data warehouses or to integrate and connect disparate systems.
The good news is that version 8.5 of IBM InfoSphere DataStage has greatly enhanced XML capabilities. I first saw the new XML features in DataStage 8.5 in demo at the Information On Demand conference last October. And I was very impressed because the XML support goes far beyond the half-hearted XML handling that many tools offer.
For example, it’s easy to import and work with XML Schemas in DataStage 8.5. Many industry standard XML Schemas that are used in the financial sector, health care, insurance, government, retail, etc. are quite complex and consist of dozens or even hundreds of XSD files that comprise a single XML Schema. Examples include FpML, FIXML, HL7, IRS1120, OAGIS, and many others.
You might receive such a schema as a ZIP file that contains multiple folders with XSD files. DataStage 8.5 can simply read the entire ZIP file, which saves you the tedious job of importing all the XSD files separately or dealing with their import and include relationships.
Once the XML Schema is imported, DataStage understands the structure of your XML document and allows you define the transformations that you need.
The XML transformation capabilities certainly include some of the intuitive things. For example, you can:
- compose new XML documents from relational tables or other sources
- shred XML documents to a set of relational row sets (or tables)
- extract selected pieces from each XML document and leave other parts of the XML unparsed and “as-is”
- extend your XML processing by applying XSLT stylesheets to the incoming XML data
And there is also a powerful set of transformation steps that allow you to implement any custom XML transformation that you may need.
A big bonus is the ability to process even very large XML files very efficiently. When batches of XML documents are moved between systems, they are sometimes concatenated into a single large XML document that can be GBs in size. Such mega-documents typically contain many independent business objects and often need to be split by the consumer. DataStage 8.5 handles documents of 20GB or larger very efficiently and can even parse and process a single large document with multiple threads in parallel! This is very cool and a big win for performance.
There is a nice 2-part article on developerWorks that describes these XML capabilities in more detail:
And the documentation of the “XML stage” starts here: