Webcast: A primer for storing and retrieving XML data
August 7, 2008
If you are looking for an introduction to the topics of storing and retrieving XML data, this free on-demand webcast is hopefully a good place to start:
On demand Webcast: A primer for storing and retrieving XML data
Native XML Storage Reduces Development Costs
August 5, 2008
I have mentioned quite a few times how native XML storage makes life easier for database administrators and application developers. I thought I’d take a few moments to go into a little more detail.
By storing the XML directly in the database, you avoid the need to shred the XML into a relational schema. Shredding is also know as decomposition. Before shredding, you need to design a relational schema for the data. This can quite often be a laborious process. Sometimes this can be automated with off-the-shelf tools. However, you should keep in mind that the resulting tables will almost certainly need to be carefully examined, and possibly optimized. After designing the relational schema, you then need to set up the environment that actually maps the XML to the relational schema. And finally, you will need to develop and test code for using the data, which is typically quite complex because of the need for unwieldy SQL statements with multiple JOIN statements.
It sounds like there is a significant overhead for shredding XML data into a relational schema. But this is only a part of the story. You also need to consider what happens when the XML schema changes. And XML schema changes are an unfortunate reality for many of us. When the schema changes, it can play havoc with your relational schema, your mapping process, and the code for your applications that use the data. It is in dealing with these updates that many organizations are realizing the greatest gains from adopting native XML storage.
If you want a real-world example of the potential impact of dealing with relational tables for XML data, consider the FpML industry standard. With native XML storage, dealing with FpML messages is as straightforward as storing and retrieving the message. However, if you use shredding to store FpML messages, in some implementations you need to work with more than 475 separate database tables.
Some organizations have measured the impact of native XML storage on the workloads of database administrators and application developers. Here are some of the findings that I am aware of:
- A large European financial services company use DB2 pureXML (pureXML is IBM DB2’s native XML storage feature) to reduce the number of lines of code for writing to and reading from their database by 65%. This reduces the amount of code they must develop, test, and maintain and allows their developers to be more productive.
- This same company reduced the amount of time needed to develop services by 75%, thus allowing them to take on additional development projects and improve the services they provide.
- They also determined that schema changes are now a breeze. In the past with shredding, adding a field took a day of work (development and test), and a week of real time because of the processes involved with database changes. Now all that is needed is to change the pointer to the schema in a DB2 XML configuration file, which takes 5 minutes.
- A leading chemical company determined that by using the IBM database tools and pureXML technology they are saving 50-75% on development costs, while improving developer productivity by 25-50%.
- And finally, a major American financial services company studied their environment and determined that by using the native XML storage in DB2 pureXML they reduce their resource requirements by 30%.
In any business, these are improvements that cannot be ignored.
When to Store in XML Format
August 4, 2008
XML is unchallenged as a data exchange format. Most industries have developed standards that use XML as the data format. The list of standards is seemingly endless: ACORD, FIXML, FpML, MISMO, HL7, NIEM, OTA, SVG, TAX1120, UBL, XBRL, and so on.
Your immediate inclination is likely to store such information in its native XML format. After all, when the information is exchanged, it is expressed in XML format. Therefore it seems natural that it would also be stored in a native XML format. Often this is the case. However, it is not always the case. Sometimes it makes more sense to store the information in relational form. This blog post discusses when to use the two approaches.
The key question that you need to ask yourself is whether the information is best expressed in a relational format or in an XML format. Because XML is so popular as a platform-neutral and transparent data exchange format, data is often sent as XML even when XML is not the ideal format for the information. For instance, information is sometimes force-fitted into a hierarchical data structure even when it has no hierarchical characteristics. If this is the case, examine the data to see if it should in fact be stored in a relational format. The general rule is that you should consider a relational structure when the data is better described in tabular format. This is the characteristic that most readily defines the relational schema.
There are also situations when you may be compelled to use a relational schema. For example, if the application that consume this information work only with relational data, then you need to take this into consideration.
You should store information in XML format when:
- The information is best represented in a hierarchal format.
When you store hierarchical information in a relational database, you often require a large number of tables. Creating and maintaining such systems is complex and costly. Multiple joins result in complicated SQL statements. In such cases, you should instead use native XML storage.
- The schema is subject to change.
Because relational tables are highly-structured, changing the schema is both difficult and costly. Updating a schema for natively stored XML, on the other hand, is a relatively painless exercise. If you expect the schema to change, you should use native XML storage.
- There are many null values in the relational tables.
Relational database tables set aside storage for each column. If your information does not populate all of these columns, you can have a large number of null values in your database. This, in turn, affects both storage requirements and performance. Again, you should consider native XML storage in such situations.
Getting Started with Native XML Databases
July 27, 2008
Many people come here while searching for the best native XML database. Being employed by a vendor with a leading native XML database, I cannot make impartial judgements in this regard. So I’m not going to try. I can, however, make sure you you know how easy it is to evaluate my employer’s native XML capabilities, so you can make up your own mind.
Not only does IBM offer a production-quality native XML database for no charge. It also provides publication-quality books about it for free. This makes it very easy for you to get started with XML databases.
DB2 9 Express-C has no data storage limits. You can store as much data as you like in the database. DB2 9 Express-C has no evaluation time limit. You can use it as long as you like. The only limits are that you use a server with a maximum of 2 cores and 2GB of RAM. Download it from the DB2 Express-C Web page.
IBM has published two books about native XML storage. These books are available for purchase. However, you can also freely view these books as HTML or download the PDF version for offline viewing and printing. Click on images below to see these books:
![]() |
![]() |
|
| DB2 9 pureXML Guide | DB2 9: pureXML Overview and Fast Start |
Good luck with your evaluations.
XQuery versus SQL/XML, Part Deux
July 25, 2008
Some of you have been asking for more information about XQuery versus SQL/XML. In particular, it appears that you are interested in understanding the levels of support for common operations. I’ll take a few moments to compare both XQuery and SQL/XML in IBM DB2 9. However, please note that not all vendors provide the same levels of support. For instance, all vendors do not support sub-document update, and those that do support it may not implement the XQuery standard. So please, before making any decisions, verify the levels of support provided by your vendor.
| Operation | XQuery | SQL/XML | Comments |
| Inserting an XML document | No | Yes | You use SQL to insert an entire XML document. |
| Retrieving an XML document | Yes | Yes | |
| Retrieving part of an XML document | Yes | Yes | |
| Using predicates with relational data | Yes | Yes | XQuery does not support relational predicates. However, IBM DB2 supports SQL in XQuery, allowing predicates with relational data. |
| Using predicates with XML data | Yes | Yes | |
| Deleting an XML document | No | Yes | You use SQL to delete an entire XML document. |
| Updating an XML document | Yes | Yes | |
| Updating part of an XML document | Yes | Yes | |
| Joining XML data | Yes | Yes | Using XQuery is the easier approach. Using SQL/XML is typically difficult to code. |
| Joining XML wth relational data | Yes | Yes | XQuery does not support joins to relational data. However, IBM DB2 supports SQL in XQuery, allowing joins to relational data. |
| Transforming XML | Yes | Yes | Using XQuery is the easier approach. Using SQL/XML is typically difficult to code. |
| Aggregating XML data | Yes | Yes | Using SQL/XML is the easier approach. Using XQuery is possible with embedded SQL, but is typically difficult to code. |
| Calling external functions | No | Yes | |
| Passing parameter markers | No | Yes |
At first glance, it may appear that SQL/XML has more extensive support. However, this is in part because logically-speaking certain tasks do not belong in XQuery. Also note that some tasks are easier to code with XQuery. This ease of coding can make a significant difference in some environments.
Questions for XML Database Vendors
July 11, 2008
Are you evaluating XML database vendors? If so, here is a list of questions that can help you when you evaluate vendors. Of course, some questions may not apply to your situation. For instance, update capabilities may not be necessary in audit and logging systems. You can weed questions out that do not apply to you.
Performance:
- Ask about the performance when inserting XML data into the repository. I came across one customer who unfortunately went with a vendor that could not keep up with with their database ingest needs and had to switch to IBM. XML query performance alone is often not a sufficient measure. In some systems you may have to use heavy indexing for good query performance, but these indexes then lead to significant overhead for insert, update, and delete. In fact, you should also probably ask about the overhead incurred when working with indexes.
- Ask about the query performance. Different types of queries have vastly different characteristics, so make sure that the performance proof points they give match your situation. For instance, are the performance proof points using the same kind of data you use, are they at the same granularity as your typical queries, and are they working across a similar data set to yours.
- Ask if their XML performance proof points are publicly available with sufficient detail about the test data, workload, hardware used, etc to verify their claims.
- Ask about any restrictions there are on their XML indexes. Can all data types be indexed?
- Ask about the scalability limits for the database.
- Ask for proof points on the reliability of the database.
Schema Support
- If you work with multiple XML schema, ask about their schema handling capabilities. For instance, do you need it to support multiple schema? Or do you need it to support different versions of the same schema in the same column? It is important to clarify these requirements up-front.
- Schema updates are inevitable. Ask what is involved when you have new versions of schema. Is it a seamless experience, or does it require a significant migration effort?
- In fact, I would recommend validating that the vendor can work with your schema up-front, especially if it is complex. I have heard of situations where people have had issues with other vendors in this regard.
- Or perhaps you have schema-less documents. If so, do all of their features support such documents?
Language Support
- If you plan to use SQL, make sure that their SQL/XML function can meet your needs.
- Similarly, if you plan to use XQuery, make sure their XQuery implementation can meet your needs.
- Confirm whether XQuery is embedded in SQL, or whether XQuery can be used standalone and via an API. This may be important to you.
- Are XML updates important to you? If so, ask about their support for update capabilities. And ask if there are any limitations in their update support. I understand that certain vendors have limitations in this regard. In particular, you will want to ask if they support the XQuery Update Facility that is being standardized by the W3C.
Miscellaneous
- Ask if there is a no-charge version of the database for pilot projects. XML features can look wonderful on paper but may be more difficult to use than you expect. Limitations in functionality and usability are not obvious from documentation and white papers, but are revealed when start doing hands-on work with your own XML data.
- If you need your reports and applications to also work with legacy relational data, ask how easy it is to work with XML and legacy relational data.
- Do you need to work with digital signatures? If so confirm that the digital signatures can be validated against retrieved documents.
- Ask to what extent they are standards-compliant.
- If high availability is important, ask if they offer high availability features.
Thanks to Matthias Nicola with his help in compiling this list. And good luck with your selection process. Cheers!
Viral Video - Wednesday
July 1, 2008
And now, the third installment. I hope you enjoy…
XQuery versus SQL/XML
June 26, 2008
XQuery and SQL/XML are two standards-based languages for retrieving information from XML. Many XML storage vendors support for both standards. Although, as is typical for standards implementation, those vendors have varying degrees of support for the standards.
Recently, some people asked me whether XQuery or SQL/XML will win the XML retrieval wars. This question surprised me. You see, I believe there is a valid need for products to support both.
XQuery is a W3C Recommendation. It is supported by vendors like IBM, Oracle, and Microsoft. The language includes features like variables, data types, operators, conditional expressions, and functions. It uses XPath expressions to select information from XML. So XQuery is, in essence, a new language for many people to learn.
SQL/XML, on the other hand, is a set of extensions to SQL. It consists of the XML data type, a collection of XML publishing functions, conversion functions, schema validation functions, and more. SQL/XML was developed by INCITS H2.3, with participation from Oracle, IBM, Microsoft, and others. So SQL/XML is, in essence, an extension to an existing language.
There is a valid need for both.
Some developers are already comfortable with XML development. These developers will easily adapt to XQuery. Also, in many circumstances, XQuery offers developers a strong combination of programming power and ease of use. Finally, XQuery offers strong performance for many XML tasks. Although, each implementation of XQuery and its accompanying database are different, so please verify your vendor’s performance in this regard.
SQL/XML, on the other hand, is ideal for environments where developers are comfortable with SQL programming. When you consider the maturity of the SQL language, together with the strong API support and domain knowledge for query optimization, you realize that SQL/XML is ideal for certain environments. It may take more effort to code certain XML tasks in SQL/XML, but that may be acceptable in some environments.
So I believe that certain environments will favor one language over the other, that there is no reason why both languages can’t exist, and that it is good to allow people to choose the language that suits them best. What do you think?
Viral Video - Tuesday
June 20, 2008
Here is part 2. Enjoy…
Viral Video - Monday
June 16, 2008
This is the first in a series of three viral videos that show off IBM’s sense of humor, showcase the pureXML capabilities, and highlight IBM’s focus on green technology. I’ll post episodes two and three when they become available. I hope you enjoy.
Maybe this will inspire you to enter the video contest over at IDUG
Learn to Use XML with Databases and win Prizes!
June 13, 2008
The International DB2 Users Group (IDUG) recently announced the Search for the XML Superstar contest. IDUG is an independent, not-for-profit, user-run organization who promote the effective use of the DB2 family of products. Naturally the focus of the contest is to teach people about databases and the storage and retrieval of XML data.
The contest consists of education, followed by a series of quizzes to test your understanding. It then proceeds to a fully-blown programming contest. There is also a video contest. Cool prizes include laptops, Nintendo Wii, and the Segway i2 personal transport system!
If nothing else, this is a great opportunity to learn about XML and get some goodies at the same time. All details of the contest do not appear to be available as I write this post. The organizers say they will post them to the Channel DB2 Web site soon.
XML-only Databases
May 13, 2008
Some database vendors offer XML-only databases. That is, these vendors offer native XML databases that do not support other types of data.
Such databases work well for isolated XML data. You can consider such databases if you do not expect your XML data to work with other information in your organization. However, if you choose this route, and later want to analyze or use the XML data with other related information, you may at best encounter a costly and troublesome integration effort. Also, keep in mind that there will be performance issues resulting from the data being in non-integrated databases.
Many people say that the recent high tech past was about business automation, and that the impending future is about business optimization. In other words, now that you have used technology to automate many aspects of your business, you will focus on optimizing these systems to gain competitive advantage. To do this, you will want to integrate and leverage all information assets in your environment and thus obtain maximum value from those assets. This is a long-winded way of saying that it is highly likely that you will need to harness all information in your environment–including both XML data and the existing data in other formats. In addition, this existing data is not, and probably never will be, stored in XML. If you use an XML-only database, integrating the XML data with other information assets will likely pose a significant problem.
On a parting note, consider the practicalities of working with multiple interfaces for different databases, multiple data management tools, different hardware and software systems, different maintenance schedules, as well as the lack of advanced data management features in XML-only databases, it all adds up to an unwieldy and expensive proposition.
When you take all this into consideration, is an XML-only database a wise choice?
All Native XML Databases are not Equal
May 12, 2008
The basic unit of storage in a native XML database is the XML data that is being stored. In other words, XML data is stored “as is” in a native XML database. What a native XML database does is straightforward. How it does this is not so straightforward. Each vendor implements its native XML database in a different way, with very different performance characteristics and capabilities.
When choosing a native XML database, you should carefully consider the implications of choosing one vendor’s database over another. The performance, scalability, and capabilities of the applications you are building will vary depending on your choice of database.
Depending on which forums you read, there are differing opinions regarding which vendor’s implementation offers the best technology. Many will argue that IBM offers the most “native” implementation, followed by Microsoft, and then followed by Oracle. The actual answer for you probably depends on the nature of the applications that you are building on top of the database. The data for each application and the query characteristics for each application are different. Also, an application may have specific requirements like XML schema flexibility. So keep these in mind as you evaluate the different options.
While I cannot authoritatively say that one implementation is better than another, I can say that IBM is confident that its DB2 pureXML implementation offers the best performance, as well as the richest set of XML capabilities. IBM has published the results of the TPoX benchmark for DB2 pureXML. Transaction Processing over XML (TPoX) is a benchmark for XML database systems that is an open source project on SourceForge. Interestingly, the other major database vendors have thus far not published TPoX results for their systems.
Native XML Reduces Storage Costs
April 8, 2008
There are several published examples of native XML storage reducing storage costs. In the past, many people claimed that ever-decreasing storage costs marginalize the value of storage savings. However, these people were considering only the cost of the actual storage devices. When you start taking factors like physical location management and power consumption into account, reducing storage costs can quickly become a very important consideration.
The first example I’d like to quote comes from a major Wall Street firm. They achieved greater than 5x storage savings by moving to the native storage of XML data together with the use of the accompanying compression capabilities. In other words, after adopting IBM DB2 pureXML, their data required 20% of the storage space it previously required.
A major North American bank achieved a 3.9x storage savings thanks to native XML storage alone. When DB2 compression was used, it had a 6x storage savings impact. By combining the native XML storage with compression, that North American bank realized a head-turning total of 11x storage savings.
Finally, an example from the telecommunications world. A prominent Chinese telecommunications company achieved 4x storage savings thanks to native XML storage.
When applied to a large scale deployment, these storage savings can have a significant impact of the operational costs of such organizations.
How to Choose Solution Areas
March 27, 2008
I recently described how vertical solutions are the best way to bring emerging technology to market. Today, I provide some guidelines for determining vertical solution areas. See the presentation embedded below for details (if you subscribe to email updates, you need to view the post in a browser to see the presentation).
Because the criteria for evaluating vertical solution areas depend on a vendor’s development capabilities and routes to market, I cannot propose a definitive list of vertical solutions. If you are interested in doing so, I will leave it as an exercise to you…
More information about Storebrand
March 25, 2008
- Order processing time has been reduced. For example, an application for a license to implement a pension plan previously took up to three weeks. It can now be completed in 10 minutes.
- Faster processing gives Storebrand the ability to handle five times the number of customer orders. Much of the manual data re-entry done by individual departments has also been eliminated, leading to fewer mistakes, higher quality and more efficient customer service.
- In the past, when storing XML in a CLOB, certain reports took between 24-36 hours to run. Now with DB2 pureXML, these same queries take between 20 seconds to 10 minutes to run.
- The time it took programmers to prepare for a search process shortened from one week to one half day.
- In the past, programming the search processes took 2 hours with shredding and 8 hours with CLOB-stored data. Now with DB2 pureXML, these same programming tasks take 30 minutes.
- Updating the schema is also much quicker. Adding a field to the shredded data took a day of work (development and test), but a week of real time because of the processes involved with database changes. For DB2 pureXML all that is needed is to change the pointer to the schema in an XML configuration file, which takes 5 minutes.
- Storebrand also achieved a 65 percent reduction in the amount of I/O code by converting 20 of its services to pureXML.
Productivity Improvements at Storebrand
March 23, 2008
Click on the image below to link to a very short video about the value a Norwegian financial services company called Storebrand is realizing by switching to a native XML database. They now have shorter development cycles and more efficient reporting cycles. But most important of all, they are leveraging this technology to realize business benefits like reducing order processing time.
How to Market New Technology
March 22, 2008
Often when someone brings new technology to market, they try to sell the technology. This is often the wrong thing to do. See the presentation embedded below for details (if you subscribe to email updates, you need to view the post in a browser to see the presentation).
So what does this mean for native XML database technology? Well, first of all, it is not database administrators who will drive the adoption of native XML databases. And it is not application developers and application architects who will drive the adoption of native XML databases. It is Line Of Business (LOB) executives who can leverage this technology for competitive advantage who will drive its adoption. In a few days, I will describe how to determine the LOB solutions that native XML databases need to initially target.
When to use a Native XML Database
March 21, 2008
I recently discovered an excellent treatise on XML and Databases by Ronald Bourret. It is certainly worth reading. Mr. Bourret describes the various options for storing XML in databases. He claims that the most important factor in choosing a database is whether the information is data-centric or document-centric. I will take some liberties and paraphrase his descriptions:
- Data-centric information uses XML solely as a data transport mechanism. It is not important that the data is, for some length of time, in XML format. Examples are sales orders, flight schedules, scientific data, and stock quotes. For such cases, he recommends using a traditional database, such as a relational, object-oriented, or hierarchical database.
- For document-centric information, XML fulfills a greater role. In this case, the XML elements, attributes, and structure are meaningful and have value. Examples are books, email, advertisements, and hand-written XHTML documents. For such cases, he recommends using a native XML database or a content management system.
He does say that these rules are not absolute, without going into further detail. Let me describe some reasons why you may want to store data-centric information in a native XML database:
- Performance improvements. See my previous post for details regarding the performance for queries.
- Greater business agility. Respond quickly to dynamic conditions by easily accommodating changes to data and schemas.
- Lower development costs. Reduced code and development complexities lead to shorter development cycles when updating your systems or adding new applications.
- Improved business insight. Gain competitive advantage through better and quicker access to business insights.
- Space savings. Databases like DB2 use inlining and compression to realize between 3x and 6x space savings when storing XML data.
I will go into more detail on some of these reasons in subsequent posts.
Native XML versus CLOB and Shredding
March 20, 2008
Matthias Nicola and Vitor Rodrigues wrote an excellent paper comparing the performance of IBM’s native XML storage (called pureXML) to non-native storage. It makes for very interesting reading. I will include some highlights here.
Traditionally, there were two approaches to storing XML data in a relational database: using a CLOB or shredding.
If XML data is stored in a character large object (CLOB) field, the data is typically inserted as unparsed text. This avoids XML parsing at insert time, however it requires XML parsing at query execution time. This leads to low search and extract performance. For instance, look at the following graph (lower numbers are better).

This table shows the results for five types of queries:
-
Select*, which is full document retrieval of all documents, no predicate
-
1Pred1Doc, which is full document retrieval of one document matching one predicate
-
5PredSome, which is full document retrieval of documents matching multiple predicates
-
PartialAll, which is partial retrieval of all documents
-
PartialSome, which is partial retrieval of all documents matching certain criteria
For most queries, native XML storage significantly outperforms CLOB storage. Only full document retrieval that ignores the XML structure quickly reads XML data from CLOB fields.
With shredding, the XML data is mapped to a relational structure (which is then stored in a relational database). Here you can see that XQuery (which is used for the native XML) outperforms SQL querying with XML conversion (which is used for shredding) for most types of query. However, because searching over relational data is faster, the query that retrieves part of the data across all records is faster for shredded data.

These findings show that, in most circumstances, native XML storage provides significant performance gains over CLOB storage and shredding. However, I show here only part of the story. For instance, I did not include performance information for the ingestion of data. I encourage you to read the full paper at A performance comparison of DB2 9 pureXML and CLOB or shredded XML storage.




My name is Conor O'Mahony. I lead XML product strategy for Data Management at IBM. These postings are my opinions and do not necessarily represent IBM’s positions, strategies, or opinions.