How to query CDATA sections in XML
May 24, 2012
First, what is a CDATA section in XML? CDATA stands for character data, and a CDATA section in an XML document defines a piece of text that should not be parsed by an XML parser.
Normally, an XML parser examines every character in an XML document, even the characters that make up the text value of an XML element. The reason for this behavior is that one element could contain other elements, and so the parser must look for the “<“-character that would denote the beginning of another element.
However, if you want the parser to ignore certain portions of text then you can enclose this text in a CDATA section. A CDATA section starts with <![CDATA[ and ends with ]]>.
A CDATA section can be useful if the text value of an XML element contains any reserved characters. For example, the characters “<” and “&” have a special meaning in XML and must not appear in the text value of an element. (The “&” marks the beginning of an entity reference in XML.) But, if text is declared as a CDATA section then its characters are ignored, including any illegal characters.
Let’s look at an example where I’m inserting the string “Food & Wine” with the reserved character “&” in it.
The first insert fails because the “&” is not properly escaped and not included in a CDATA section. The other 4 inserts are successful. In these 4 cases the CDATA section spans different portions of the string but always includes the “&”:
create table cdata(id int, doc XML); insert into cdata values(0, '<doc><name>Food & Wine</name></doc>'); insert into cdata values(1, '<doc><name><![CDATA[Food & Wine]]></name></doc>'); insert into cdata values(2, '<doc><name><![CDATA[Food &]]> Wine</name></doc>'); insert into cdata values(3, '<doc><name>Food <![CDATA[& Wine]]></name></doc>'); insert into cdata values(4, '<doc><name>Fo<![CDATA[od & Wi]]>ne</name></doc>');
When you insert and later retrieve an XML document in DB2, the CDATA sections are not preserved. Instead, any offending characters in the CDATA section are escaped according to the XML standard. Upon insert the CDATA sections are removed, and upon retrieval the “&” is escaped as & as it should be. Similary, a “<” sign in the CDATA section would be escaped to < .
select * from cdata; 1 <doc><name>Food & Wine</name></doc> 2 <doc><name>Food & Wine</name></doc> 3 <doc><name>Food & Wine</name></doc> 4 <doc><name>Food & Wine</name></doc> 4 record(s) selected.
To be precise, the escaping of reserved characters happens upon XML serialization to text.
Now what happens if we want to index and query such XML that has (had) CDATA sections in it? Well, let’s give it a try. First, let’s create an index and then query for “Food & Wine” with the “&” in it:
create index idx1 on cdata(doc) generate keys using xmlpattern '/doc/name' as sql varchar(35); select id from cdata where xmlexists('$DOC/doc/name[ . = "Food & Wine"]'); SQL16002N An XQuery expression has an unexpected token "&" following "Food ".
The query fails, as expected, because the “&” is a reserved character. We need to follow XQuery rules and escape the “&”:
select id from cdata where xmlexists('$DOC/doc/name[ . = "Food & Wine"]'); ID ----------- 1 2 3 4 4 record(s) selected.
As expected, all 4 rows are returned.If you check the execution plan you will also find that the index can be used.
Conclusions: The use of CDATA sections does not impair the queryability and indexability of XML elements in DB2. Also, it doesn’t matter whether you wrap the CDATA section just around the offending characters or around a larger piece of text in the element value.
Of course, in this simple example we wouldn’t necessarily have to use a CDATA section to escape the “&” in the input documents. We could have escaped the “&” as “&” before inserting the documents. But, if an an element contains a longer piece of text that comes from an XML-agnostic source (i.e. various illegal characters may be present), then it’s much easier to make it a CDATA section than to find and escape the offending characters one by one.