XML versus Relational Database Performance

August 22, 2010

I have been asked many times: “What is faster, XML or Relational?”. Of course, this question oversimplifies a complex issue, and so the only valid answer is “It depends!”. Sometimes people ask the same question in a slightly different way: “If I have a relational table and convert each row into a small XML document and store these documents in a separate table with 1 XML column, what’s the performance difference between the two tables (for inserts/updates/queries)?”. But, in most cases such a conversion is not recommended and this type of comparison is, again, too simplistic.

Let’s say you want store, index, and query 1 million addresses and each address has a first name, last name, street, city, state, and zip code. That’s a simple and fixed structure and it’s the same for all records. It’s a perfect fit for a relational database and can be stored in a single table. Relational databases have been optimized for decades to handle such fixed records very efficiently. However, if the application needs to convert the address data to XML format anyway, it can often be faster to store the data permanently in XML and avoid the repeated conversion from relational format to XML format in the application.

Now consider a scenario where the business objects of interest are a lot more complex and variable than simple addresses. For example, derivative trades in the financial industry are modeled in XML with an XML Schema called FpML (financial products markup language). It defines more than 7000 fields (many are optional) with hundreds of 1-to-many relationships between them. Designing a relational schema to represent such objects is very hard and leads to hundreds of tables. The process of inserting (normalizing) and reading (denormalizing) a single object into such a relational schema can easily be 10x or 100x slower than inserting and reading a corresponding XML document in a native XML column (e.g. in DB2 pureXML).

So, any performance comparison of XML versus relational depends heavily on which data you choose for the comparison, and what type of operations you measure.

A large DB2 customer recently compared XML to relational performance because their business objects are currently mapped to 12 relational tables. Their application executes at least 12 SQL statements to retrieve all the relational data that comprises one of the logical business objects. Then the application reassembles the original business object. An alternative is to store these business objects as XML so that each object is stored as a single document. Instead of 12 tables, only 1 table with 1 XML column is then needed. In multi-user tests for data retrieval the the company found that the XML-based solution allows them to retrieve objects with 55% higher throughput than the existing relational SQL-based solution. The reasons for the performance benefit include fewer JDBC interactions between application and database as well as fewer distinct pages that need to be read when one logical business object is represented as one XML document (and not scattered over 12 tables). These tests were later repeated and verified at an IBM lab.

Another important considertion in the question of XML versus relational is the data format in which the data is produced and consumed outside the database. If the data is produced and/or consumed in XML format anyway, it is often better to also store the data as XML in the database.

So, the question “What is faster, XML or Relational?” is somewhat like asking “What is faster, a truck or a ship?”, because XML and relational are meant for different purposes, and either one can outperform the other depending on which use case you look at. And there are also use cases (with high schema complexity and schema variability over time) that cannot reasonably be implemented in a relational data model. (If you need to go to Hawaii, the boat always beats the truck!).

The beauty of a hybrid database system such as DB2 is that you can use both native XML and relational capabilitues side by side, in a tightly integrated manner. Some data is better representated in relational tables, other data is better represented in XML, and you can manage both in the same database or even the same table, and with the same APIs and utilities.

About these ads

26 Responses to “XML versus Relational Database Performance”

  1. Bsmallah Says:

    which is better xml or traditional DB in asmall one?


  2. Well, it depends. What I was trying to convey in this blog post was that there is no simple answer to this question. The answer depends heavily on the type of data and workload that needs to be supported. Do you have a specific application scenario in mind?

  3. Chris Says:

    Hello Matthias.
    First of all I want to congratulate you on this post.
    I am trying to decide what I should use, XML or db, for the next situation: I need to make a chat bot that supports a few thousands answers and questions, and I am not sure whether to use AIML, based on XML, or a db to store everything. Maybe you can help me with an opinion.
    Thank you in advance.


    • Hi Chris,

      when you say you’re not sure whether to use use AIML (based on XML) or a db to store everything, my immediate response would be that using an XML format (such as AIML) and using a database are not mutually exclusive. You can do both. You can use the XML format that is best fit for your scenario, and also use a database to store and query this XML.

      The first question is which data format is most suitable for your application. The data format should provide all the flexibility and extensibility that you might need, and should allow you to model all application data in a meaningful way, so that application development becomes easy. If that turns out to be an XML format – great! If your application needs to manipulate or produce the data in XML format anyway, then using XML also as a storage format is often beneficial because it avoids the cost of data transformations to/from XML format.

      The second question is where you will store that data, and this question is orthogonal to the first. Do you use a database (such as the free edition of DB2) or simply a file in the file system? This depends on how much data you will have, how efficient the search needs to be, and whether you need transactional consistency, support for concurrent updates, logging and recovery, and all the other advantages that a database provides over a file in the file system.

      If the answer are “yes, I want to use XML”, and “yes, I need a database”, then that’s what you should do.

      Matthias


  4. Hi,
    We have setup environment to benchmark the xml performance of DB2, SQLSERVER and relational model in SQLSERVER.
    We have 10000 transactions and retrieving 100 on the basis of filtering. In-order to test the worst case scenario, we managed to clear the buffer cache for SqlServer and restarted DB2 prior each execution.
    This query took 80 seconds in Sql Server, while it took .5 second for relational data model and in db2 it took 2 seconds. Why this query is taking 2 seconds in DB2, it should be less than a second?

    We have indexed the 3 elements of XML column, used in the where clause, as well as other fields in the where.

    Table Name Number of Records in table Number of records returned
    ———- ————————– ————————–
    ApplicationData 10000
    tmptblStage 8
    tmpuser 3

    SELECT
    A9931.ApplicationInstanceId as c_ApplicationInstanceId,
    A9931.RecordID as c_RecordID,
    A9931.CurrentStage as c_CurrentStage,
    Stg.CurrentStageName AS Current__987__Stage,
    U.FullName as Originator,
    AT.FullName as Assign__987__To,
    AT.UserID as Assign__987__ToID, A9931.CreatedBy as CreatedBy,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/RecordID/text()’) as RecordID,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name/text()’) as Name,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Phone__987__Number/text()’) as Phone__987__Number,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Email__987__Address/text()’) as Email__987__Address,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Experience/Row/Years__987__of__987__service/text()’) as Years__987__of__987__service,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Experience/Row/Job__987__Title/text()’) as Job__987__Title,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name75/text()’) as Name75,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name76/text()’) as Name76,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name77/text()’) as Name77,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name78/text()’) as Name78,
    XMLQUERY( ‘$APPINSTANCEDATA/ApplicationData/Name79/text()’) as Name79
    from ApplicationData A9931,tmptblStage Stg,tmpuser u,tmpuser AT
    WHERE
    (
    XMLEXISTS(‘$APPINSTANCEDATA/ApplicationData[Name[1] cast as xs:string ? = xs:string(“Applicant1″) ]’)
    And XMLEXISTS(‘$APPINSTANCEDATA/ApplicationData/Experience/Row[Job__987__Title[1] = “Software Engineer”]’)
    And XMLEXISTS(‘$APPINSTANCEDATA/ApplicationData/Experience/Row[Years__987__of__987__service[1] = 10]’)
    )
    AND A9931.ApplicationDefinitionID=9931 AND A9931.CurrentStage ‘s_84b7708a944a4595a0e11de829fef61e’AND A9931.CurrentStage = Stg.CurrentStageID
    AND A9931.Status = 1
    AND A9931.CreatedBy = U.UserID
    AND A9931.Assigned_To_User =AT.UserID
    and (A9931.CurrentStage in (‘s_1d325a5bd2a34e148f6af55393ab8013′,’s_d5c1c2f03aa34a229024d25c3bca0e3e’,’s_332f79e0d1c54057ad35b96eec740b58′,’s_332f79e0d1c54057ad35b96eec740b58′,’s_c3deda94df1a40769e7b5baf003ee0dc’,’s_c3deda94df1a40769e7b5baf003ee0dc’,’s_ca28f4b047af48f59f028eef57316e49′,’s_70f2ff64c7fa4d5e9ee7e8fc3f783cae’,’s_1d325a5bd2a34e148f6af55393ab8013′,’s_d5c1c2f03aa34a229024d25c3bca0e3e’,’s_84b7708a944a4595a0e11de829fef61e’,’s_332f79e0d1c54057ad35b96eec740b58′,’s_c3deda94df1a40769e7b5baf003ee0dc’,’s_ca28f4b047af48f59f028eef57316e49′,’s_70f2ff64c7fa4d5e9ee7e8fc3f783cae’))
    ;


    • Hi Kashif,
      thanks for your question. So, I understand that you are comparing the performance of this query in 3 environments:
      1. XML data in SQL Server –> 80 seconds
      2. XML data in DB2 –> 2 seconds
      3. relational data in SQL Server –> 0.5 seconds
      Have I summarized this correctly? To further improve the DB2 performance, it would be good to look at the execution plan to verify whether it is the most optimal plan and to see which of the indexes have been used. I’ll be happy to help you with this investigation. Could you capture the execution plan (ideally with the db2exfmt tool) and send the output file from db2exfmt to me at mnicola@us.ibm.com? Let me know in case you’d like instructions for db2exfmt. I have a few more ideas of what we could look at, but let’s start with the execution plan. I’m looking forward to receiving an email from you!

      Thanks,

      - Matthias


      • Hi Matthias,
        Thanks for your quick response. Yes you have summarized the analysis correctly.

        The execution plan of the query has been emailed to you.

        – Kashif Irshad

  5. Albus Says:

    Hii Dr. Nicola,

    We are working with huge (say about 4K tags) and complex xmls , though the queries aren’t that complex . Do you think we should go for a document based database or a xml based database.


    • Hi Albus,
      I would need to see a lot more information before I could possibly give you a reliable recommendation. For example, what exactly do you mean by “document based database”? Do you mean a Content Management repository that has limited XML capabilities?

      You should make a list of the most important capabilities that you need (e.g. indexing, support for XQuery or maybe SQL/XML, support for XML updates, high availability, scalability, etc.) and check which systems comes closest to your needs. The choice will also depend on the volume of documents and the number of users and queries that you will need to support.
      Due to the complexity and the large number of (distinct?) tags, it will likely be very hard to map your XML to a traditional relational database schema. So, a pure relational storage might not be the best choice.

      Thanks,

      Matthias

      • Albus Says:

        Hi !!
        Thank you for replying .
        What i meant from a document based db was something like mongoDB or cassandra.

        The db would be used for generating reports from the xmls stored, without many users , so scalability would not be an issue.

        The tags will be distinct and i need to write queries to fetch data from them.

        thanks

  6. Enom Says:

    Hi Matthias, I work in an enterprise environment as UI/UX. Some of our developers prefer to develop in a manner where data is stored in a DB (Oracle 10g) then generating XML files that are then stored on the web server as files (not in DB). The claim is quicker access and will result in no broken applications should the db server be overloaded or go down. I’ve experienced multiple breakages on the front end due to the XML not being parsed properly, thus the end result is my front end interfaces being broken. Is this a good strategy to use XML in this way, versus simply doing db calls?


  7. Hi Enom,

    there might be exceptions, but as a general design concept I would not recommend replicating data from a database server to the file system of the application server or web server. There are very good reasons why a database server is being used in the first place! (Certainly, an app server or web server may do its own caching of data, but that’s a different story.)

    I also don’t quite agree with the motivation, i.e. to protect the application from database outages or overload. There are better ways to address such issues. Database systems have proper features for high availability, fault tolerance, and for performance and load management.

    And most of all, storing data as XML files in the file system means that XML parsing is required to access this data. XML parsing is an expensive operation and a drag on performance. The benefit of native XML storage in a database is that the XML is stored in a parsed format and therefore accessible without incuring XML parsing. Excessive XML parsing on a web server or application server can easily lead to performance problems or out-of-memory situations on these servers. I know many applications that have chosen native XML storage in a database to relieve the application server from repeated XML parsing, to reduce CPU and memory consumption, and to improve end-to-end system performance.

    Matthias

  8. pradeep Says:

    i’m doing a project on XML before i start off my project i want to know that in which situation we are going to use XML database and why?


  9. Hi Pradeep,

    for starters, here are 5 reasons / use cases for storing XML in a database:

    http://nativexmldatabase.com/2010/09/28/5-reasons-for-storing-xml-in-a-database/

    - Matthias

  10. pradeep Says:

    when we store the data in xml there is a possibitly of ambiguity occurs when we try to search witha keyword this happens only when text node and text value are same…ryt??

  11. pradeep Says:

    Towards an effective XML keyword search is my project which is an IEEE paper 2010


  12. You mean ambiguity between text that is XML tag name vs. the value of an XML element or attribute? The DB2 Text Search feature for XML distinguishes between tag names and actual values, so there is no ambiguity.

    • pradeep Says:

      Q1)When we use XML for database as a storage then what is the need of DB2 . Is DB2 is used for jus resolving the ambiguity incase for the XML ?i knew DB2 is a kind of Data processing unit for the database.

      Q2)I have been asked many times: “What is faster, XML or Relational?”. Of course, this question oversimplifies a complex issue, and so the only valid answer is “It depends!” as said by you.Could you gimme an example for this i mean for the small data is XML faster?

      Do you mind that i’m asking you questions many times?
      because i don’t have assistance or help from any1.


      • Hi Pradeep,

        re Q1): Note that XML itself is not a database. XML is just a data format. Where do you store the XML? Storing the XML in a database like DB2 gives you significant benefits over storing XML in the file system, such as full support for XQuery and SQL/XML to query the XML, XML indexing for performance, XQuery Update Facility, efficient XML Schema validation, full logging and recoverability, and options for replication, high availability, data partitioning and clustering, and so on. Many of the same reason why you store relational data in a database and not in a text file.

        re Q2) On example where storage data as XML is faster than normalized relational storage is when you are dealing with complex business records, such as financial trades, patient records, tax returns, and so. Storing them in a normalized relational database schema often requires dozens if not hundreds of tables. This implies that a business record needs to broken into many pieces that need to be inserted across the tables, and later these pieces need to be retrieved to reassemble the original business record. This can be complicated and slow. If each business record is represented as a single XML document, then insert and retrieval can often be much faster.

        In this context you may also find the following article interesting:
        http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/index.html

        Matthias

  13. Adriana Says:

    Hi.
    I like your article. I am writing my final thesis at the university and I needed advice. Do you know an application EBX? I compare XML and relational repository for the platform. I did not find any literature on this subject and I really needed your help. Thank you Adriana

  14. Basel Shbeb Says:

    Actually this post is reasonable and i liked the concept of “it depends” that is used above. Now i have a problem for more than 4 month and i tried so many ways but nothing is useful i would like to present the case to you hoping you could give me some notes or help me in any way.the case is as following:
    I am programming a website for “livescore” i used PHP,JavaScript,and HTML I have more than 500 XML files (1.5MB each) that i should request them all in some point in my website,this process take time more than you ever imagine i tried to download those files everyday automatically and request them using $ajax locally from my server that is reduced the time very much but it still tooooo long (it needs more than 3 minutes to load those 500 xml pages locally). Because I know that it is not efficient to call 500 huge & slow file into website (it is even not reasonable) I am trying to find a way to solve this huge problem so i thought about converting XML (which is from third party) to database records after using a parser and then cal the information from my database
    i used SAX parser with XML and save the information in MySql tables but unfortunately the php code that i used to parse the XML is stopping (not finishing all the file) it is not parsing issue and not sql error i think it is memory error or resource error. Now i am done with it i try every single thing that i could do with it and nothing useful then i came by your site i hope you could help me with that i want to know is it possible to write a code to convert my 500 XML (Mega-Size files) to (sql records it will be millions of records) and how? and if it is possible is it efficient and fast beacuse my main problem is slow ? i hope you could help me with that thanks


    • Hi Basel,
      I probably don’t know enough about your application to give precise recommendations.
      1.5 MB per XML file is somewhat larhge, but not very large and should not make a SAX parser run out of memory. The problem you have seen may depend on the specific parser you are using, or on how you are using it.

      Certainly, such XML files can be converted to relational format, which is a decision that is subject to various trade-offs. The general pros and cons are discussed in the free sample chapter (section 11.1) of the pureXML Cookbook:
      http://www.ibmpressbooks.com/content/images/9780138150471/samplepages/0138150478_Sample.pdf

      If you decide that converting the XML to relational format is the right choice for your case, I would suggest that you let the database do the work for you, so that you don’t have to deal with XML parsing at all. You should only need to define the declarative mapping from XML to relational tables, and then let the database do the rest. That’s much easier and less error-prone that writing procedural code to crunch the XML.

      You could check whether your database (mySQL) offers such in-built XML capabilities. You could also use the free version of DB2 and use its XML features. The DB2 features to convert XML to relational are describes in the sample chapter mentioned above. Ingesting XML files of up to 2GB works very well. The performance is approximately linear in the size of the documents.
      Good luck.

  15. anwar Says:

    how do you monitor performance of an xml database, say in a load test scenario, what should we monitor? just hardware resouces?


  16. Hi Anwar,

    sorry for the delay in responding.

    What to monitor really depends on what you are after and what the goals are of the load test. Yes, I would certainly monitor system resources such as CPU utilization, I/O wait time, and IOPS, typically using tools such as vmstat and iostat.

    Additionally you may want to monitor worklaod level metrics, such as response times and/or throughput of the queries or transactions that the load test is executing.

    And then for database tuning purposes you may also want to monitor database internal counters such as the buffer pool hit ratio, lock waits (if any), time spent on log writes (if any), number of page reads and writes, and s on.

    Matthias

  17. leobel Says:

    Hi. In process mining environment, I have to identify the best way to store and manage the event logs, assuming that these are modeled by XES format (eXtensible Event Stream), which represents a standard for managing relationships and attributes traces and events in an event log. The decision taken should be based on the characterization of this model only, as there is no real data to measure the performance of several implementations. Because of this I need a way to correlational variables present in the XES format, so that for a proposed solution, to predict their performance. In specific I mean the implementation either in a relational database, XML-native, hybrid, etc.. It is important to note that the process mining algorithms are the users, namely that processing tool must be taken into account with these layers interact efficiently….


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: