January 21, 2011
Today’s topic is “Name/Value Pairs” – a data modeling approach that is also known as Key-Value Pairs (KVP) or Entity-Attribute-Value model (EAV).
In the relational database world, the Name/Value Pair approach is sometimes used when the number of required columns in a table is very large, unknown, or changing over time. In that case it is difficult to define a table with a fixed set of columns. That’s when some people resort to Name/Value Pairs.
To give an example, consider the following record from a traditional database table with columns id, firstname, lastname, company, and dateOfBirth:
(1, John, Doe, IBM, 1978-11-05)
In a Name/Value Pair table, the same information would be stored across four different rows with columns id, name, and value:
(1, firstname, John)
(1, lastname, Doe)
(1, company, IBM)
(1, dateOfBirth, 1978-11-05)
The flexibility of this table is wonderful (at first!), because you can store records of pretty much any format in such a table, even if the set of required fields is not known in advance or growing over time. Instead of defining fixed column names, you simply stick the field names into the column name and pair them up with a value.
BUT…. it’s well understood that such Name/Value Pair tables have serious problems that cause a lof of headache over time. In particular:
- It’s very hard to define constraints for Name/Value Pair tables, because the logical meaning of the column value changes from row to row. Data quality is hard to enforce.
- Writing business queries against Name/Value Pair tables is complex and often requires many self-joins. Reporting queries with many predicates are particularly difficult and inefficient.
- Column statistics that usually aid database optimizers in the selection of good access plans often do not work well for Name/Value Pairs.
- Mixing values of different data types in the column value causes further complications. The column value needs to be of type VARCHAR which means that correct numeric or DATE comparisons are not straight forward, may require casting, and it’s difficult to define adequate indexes. Note that the numeric comparison value > 5000 is not the same as the string comparison value > ‘5000’. The latter would also be true if value contains the string ‘7’. A workaround is to define multiple value columns, one for each required data type, but this leads to sparsely populated columns with lots of NULL values. It’s just not pretty.
Others have blogged extensively about the problems with Name/Value Pairs in relational databases long before me, and here are some examples:
What I actually want to write about is the use of Name/Value Pairs in XML.
But.. wait a minute! Why would anybody use Name/Value Pairs in XML? XML is extensible and flexible by design, so there isn’t really a problem with being constrained by a fixed set of fields. New XML elements and attributes can be invented and added as needed, specifically to deal with variable and evolving data formats. Many applications use XML exactly for this reason.
However, I have seen various cases where people have decided to model their XML with Name/Value Pairs. As a simple example, the following XML document describes a car with Name/Value Pairs:
<object type="car"> <field name="brand" value="Honda"/> <field name="price" value="5000"/> <field name="year" value="1996"/> </object>
This XML document uses generic and non-descriptive element and attributes names, such as object, field, name, and value. This defeats the purpose of XML as a markup language where tag names should be used to describe the meaning of the different data items. This is similar to column names of a relational table, which should describe the data in the columns.
In this example, the values “brand”, “price”, and “year” are metadata and should really be tag names, because they describe the intended meaning of the values “Honda”, “5000”, and “1996”. Here is what this XML document should look like:
<car> <brand>Honda</brand> <price>5000</price> <year>1996</year> </car>
The problems with using Name/Value pairs in XML are similar as in relational tables:
- For the XML document with Name/Value Pairs in red above, it is impossible to define an XML Schema that enforces, for example, the following rule: If the attribute “name” has the value ‘price’, then the attribute “value” in the same “field” element must be numeric and greater than 0. In contrast, this rule is easy to define in an XML Schema for explicit tag names.
- Let’s say you often search by “year” and want an XML index to support such searches. If you use Name/Value Pairs, most databases will not allow you to define an index on only those “value” attributes where the “name” attribute in the same “field” element has the value ‘year’. As a result you will likely index more data than needed, which means your index is larger and more expensive to maintain than with explicit tag names. Insert, update, delete performance will suffer.
- Writing XPath expressions against Name/Value Pairs is complex and less efficient.
Not yet convinced? Try to write an XPath that retrieves the year of all the Honda cars that have a price greater than 5000. For Name/Value Pair documents the query would look like this (yikes!):
/object[@type="car" and field[@name = "brand" and @value = "Honda"] and field[@name = "price" and @value >= 5000] ]/field[@name="year"]/data(@value)
For corresponding documents with explicit tag names, the same query is much simpler and will be faster:
/car[brand="Honda" and price >= 5000]/year
I rest my case.
January 8, 2011
XML processing is at the core of many middleware systems, often to route, split, or modify XML messages, or to extract values from them. Frequently, such XML processing is implemented with XSLT stylesheets or in application code with DOM or SAX parsers.
But, when XML messages are stored in a database system, application designers have a choice whether to perform XML manipulation in the application code or in the database server with XQuery or SQL/XML.
When a native XML database stores XML in a parsed format, such as in DB2 pureXML, then most XQuery and SQL/XML operations can be performed without additional XML parsing. This can be a significant performance benefit of processing XML in the database layer.
Coding XML manipulation declaratively in XQuery or SQL/XML is typically shorter than writing procedural application code that uses DOM or SAX. Hence, there is also a difference in development cost.
We have performed some tests to compare the performance and development cost of using SQL/XML or XQuery in DB2 as opposed to using DOM or SAX in a Java application. We started by defining several common XML processing uses cases:
- Use Case 1.1 (“Extract5”): Extract a small number of values from randomly selected XML documents (approx5% of the element values)
- Use Case 1.2 (“Extract50”): Extract a medium number of values (50%).
- Use Case 1.3 (“Extract100”): Extract most values (almost 100%).
- Use Case 2.1 (“Split500”): Split a large XML document containing 500 concatenated XML fragments into individual XML documents. Concatenated XML documents are often used for download or FTP, but are not useful for individual processing.
- Use Case 2.2 (“Split2500”): Split a document containing 2500 fragments.
- Use Case 2.3 (“Split5000”): Split a document containing 5000 fragments.
- Use Case 3 (“Modify”): Insert, change, and delete XML elements in each randomly selected XML document.
We have implemented these uses case in a Java application in two ways:
(a) using SQL/XML and XQuery to push the XML manipulation to a DB2 9.7 database, and
(b) using a DOM and/or SAX parser in the Java code itself.
Both options read or write XML to/from a DB2 database, but the location and implementation of the XML manipulation differs. In option (a), the XML manipulation happens in the database as part of the database read or write operations. Option (b) reads and writes full documents from/to the database, but XML specific operations happen exclusively in the Java code. We measured all tests on the same hardware to ensure an apples-to-apples comparison.
Let’s look at some of the results…
The following chart shows the throughput of the three “Extract” use cases in transactions per second. The SQL/XML-based value extraction in the database provides notably higher throughput than SAX parsing in Java, especially when the number of extracted values is small.
We observed similar performance benefits for performing “Split” operations (use cases 2.1, 2.2, and 2.3) in the database.
Modifications to XML document are often implemented with XSLT or a DOM parser in application code. However, we find that inserting, updating, or deleting XML elements and attributes in a document can be significantly faster when XQuery Update expressions are run inside the database. The next chart shows the throughput for a mix of common XML Update operations for 1, 5, 10, and 15 concurrent users:
Even more important than the performance benefit of in-database XML manipulation might be the reduction in development cost.
We counted the lines of code for both of our implementations, including the Java code to prepare and submit SQL/XML or XQuery statements and to bind extracted values to Java objects. The following chart shows that application-level XML processing can require 10x to 13x as many lines of code than database-level XML manipulation.
This is very significant since the number of lines of code is a common metric to estimate the software development cost, number of coding errors, and software maintenance cost over time. Although some (but not nearly all!) of the development labor can be reduced by using XML mapping frameworks (such as JAXB) XQuery and SQL/XML are still much shorter and typically less costly to develop and maintain.
For more details on this topic, see: