How to query CDATA sections in XML

May 24, 2012

First, what is a CDATA section in XML? CDATA stands for character data, and a CDATA section in an XML document defines a piece of text that should not be parsed by an XML parser.

Normally, an XML parser examines every character in an XML document, even the characters that make up the text value of an XML element. The reason for this behavior is that one element could contain other elements, and so the parser must look for the “<”-character that would denote the beginning of another element.

However, if you want the parser to ignore certain portions of text then you can enclose this text in a CDATA section. A CDATA section starts with <![CDATA[  and ends with ]]>.

A CDATA section can be useful if the text value of an XML element contains any reserved characters. For example, the characters “<” and “&” have a special meaning in XML and must not appear in the text value of an element. (The “&” marks the beginning of an entity reference in XML.) But, if text is declared as a CDATA section then its characters are ignored, including any illegal characters.

Let’s look at an example where I’m inserting the string “Food & Wine” with the reserved character “&” in it.

The first insert fails because the “&” is not properly escaped and not included in a CDATA section. The other 4 inserts are successful. In these 4 cases the CDATA section spans different portions of the string but always includes the “&”:

create table cdata(id int, doc XML);

insert into cdata
values(0, '<doc><name>Food & Wine</name></doc>');

insert into cdata
values(1, '<doc><name><![CDATA[Food & Wine]]></name></doc>');

insert into cdata
values(2, '<doc><name><![CDATA[Food &]]> Wine</name></doc>');

insert into cdata
values(3, '<doc><name>Food <![CDATA[& Wine]]></name></doc>');

insert into cdata
values(4, '<doc><name>Fo<![CDATA[od & Wi]]>ne</name></doc>');

When you insert and later retrieve an XML document in DB2, the CDATA sections are not preserved. Instead, any offending characters in the CDATA section are escaped according to the XML standard. Upon insert the CDATA sections are removed, and upon retrieval the “&” is escaped as &amp; as it should be. Similary, a “<” sign in the CDATA section would be escaped to &lt; .

select * from cdata;

1      <doc><name>Food &amp; Wine</name></doc>
2      <doc><name>Food &amp; Wine</name></doc>
3      <doc><name>Food &amp; Wine</name></doc>
4      <doc><name>Food &amp; Wine</name></doc>

4 record(s) selected.

To be precise, the escaping of reserved characters happens upon XML serialization to text.

Now what happens if we want to index and query such XML that has (had) CDATA sections in it? Well, let’s give it a try. First, let’s create an index and then query for “Food & Wine” with the “&” in it:

create index idx1 on cdata(doc) generate keys using xmlpattern '/doc/name' as sql varchar(35);

select id
 from cdata
 where xmlexists('$DOC/doc/name[ . = "Food & Wine"]');

SQL16002N  An XQuery expression has an unexpected token "&" following "Food ".

The query fails, as expected, because the “&” is a reserved character. We need to follow XQuery rules and escape the “&”:

select id
 from cdata
 where xmlexists('$DOC/doc/name[ . = "Food &amp; Wine"]');

ID
 -----------
 1
 2
 3
 4

4 record(s) selected.

As expected, all 4 rows are returned.If you check the execution plan you will also find that the index can be used.

Conclusions: The use of CDATA sections does not impair the queryability and indexability of XML elements in DB2. Also, it doesn’t matter whether you wrap the CDATA section just around the offending characters or around a larger piece of text in the element value.

 

Of course, in this simple example we wouldn’t necessarily have to use a CDATA section to escape the “&” in the input documents. We could have escaped the “&” as “&amp;” before inserting the documents. But, if an an element contains a longer piece of text that comes from an XML-agnostic source (i.e. various illegal characters may be present), then it’s much easier to make it a CDATA section than to find and escape the offending characters one by one.

 

 

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: