51393

java: remove cdata tag from xml

Question:

xpath is nice for parsing xml files, but its not working for data inside the cdata tag:

<![CDATA[ Some Text

more text and tags

... ]]>

My solution: Get the content of the xml first and remove

"<![CDATA[" and "]]>".

After that I would run xpath "to reach everything" from the xml file. Is there a better solution? If not, how can I do it with a regular expression?

Answer1:

The reason for the CDATA tags there is that everything inside them is <strong>pure text</strong>, nothing which should be interpreted directly as XML. You could write your document fragment in the question alternatively as

<pre class="lang-xml prettyprint-override"> Some Text &lt;p&gt;more text and tags&lt;/p&gt;...

(with a leading and trailing space).

If you really want to interpret this as XML, extract the text from your document, and submit it to an XML parser again.

Answer2:

I needed to accomplish the same task. I have solved it with two xslt.

Just let me stress that this will only work if the CDATA is <a href="http://www.w3.org/TR/REC-xml/#sec-well-formed" rel="nofollow" title="well-formed xml">well-formed xml</a>.

To be complete, let me add to your example xml a root element:

<pre class="lang-xml prettyprint-override"><root> <well-formed-content><![CDATA[ Some Text

more text and tags

]]> </well-formed-content> </root>

<strong>Fig 1.- Starting xml</strong>

<hr />

First step

In the first transformation step, I have wrapped all text nodes in a new introduced xml entity old_text:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" version="1.0" encoding="UTF-8" standalone="yes" /> <xsl:template match="*"> <xsl:copy> <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" /> </xsl:copy> </xsl:template> <!-- Attribute-nodes and comment-nodes: Pass through without modifying --> <xsl:template match="@*|comment()|processing-instruction()"> <xsl:copy-of select="." /> </xsl:template> <!-- Text-nodes: Wrap them in a new node without escaping it. --> <!-- (note precondition: CDATA should be valid xml. --> <xsl:template match="text()"> <xsl:element name="old_text"> <xsl:value-of select="." disable-output-escaping="yes" /> </xsl:element> </xsl:template> </xsl:stylesheet>

<strong>Fig 2.- First xslt (wrapping CDATA in "old_text" elements)</strong>

If you apply this transformation to the starting xml this is what you get (I'm not reformatting it to avoid confusion about who does what):

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text> </old_text><well-formed-content><old_text> Some Text

more text and tags

</old_text></well-formed-content><old_text> </old_text></root>

<strong>Fig 3.- Transformed xml (first step)</strong>

<hr />

Second step

You now need to clean-up the introduced old_text elements, and re-escape the text that didn't create new nodes:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" version="1.0" encoding="UTF-8" standalone="yes" /> <!-- Element-nodes: Process nodes and their children --> <xsl:template match="*"> <xsl:copy> <xsl:apply-templates select="*|text()|@*|comment()" /> </xsl:copy> </xsl:template> <!-- Attribute-nodes and comment-nodes: Pass through without modifying --> <xsl:template match="@*|comment()"> <xsl:copy-of select="." /> </xsl:template> <!-- 'Wrapper'-node: remove the wrapper element but process its children. With this matcher, the "old_text" is cleaned, but the originally CDATA well-formed nodes surface in the resulting xml. --> <xsl:template match="old_text"> <xsl:apply-templates select="*|text()" /> </xsl:template> <!-- Text-nodes: Text here comes from original CDATA and must be now escaped. Note that the previous rule has extracted all the existing nodes in the CDATA. --> <xsl:template match="text()"> <xsl:value-of select="." disable-output-escaping="no" /> </xsl:template> </xsl:stylesheet>

<strong>Fig 4.- 2nd xslt (cleaned-up artificially-introduced elements)</strong>

<hr />

Result

This is the final result, with the nodes that originally where in CDATA expanded in your new xml file:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" standalone="yes"?><root> <well-formed-content> Some Text

more text and tags

</well-formed-content> </root>

<strong>Fig 5.- Final xml</strong>

<hr />

Caveat

If your CDATA contains html character entities not supported in xml (take a look for examples at this <a href="http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references" rel="nofollow" title="XML and HTML character entities">wikipedia article about character entities</a>), you need to add those references to your intermediate xml. Let me show this with an example:

<pre class="lang-xml prettyprint-override"><root> <well-formed-content> <![CDATA[ Some Text

more text and tags

, now with a non-breaking-space before the stop:&nbsp;.]]> </well-formed-content> </root>

<strong>Fig 6.- Added character entity &nbsp; to xml in Fig 1</strong>

The original xslt from <strong>Fig 2</strong> will convert the xml into this:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text> </old_text><well-formed-content><old_text> Some Text

more text and tags

, now with a non-breaking-space before the stop:&nbsp;. </old_text></well-formed-content><old_text> </old_text></root>

<strong>Fig 7.- Result of a first try to convert the xml in Fig 6 (Not well-formed!)</strong>

The problem with this file is that it is not well-formed, and thus, cannot be further processed with a XSLT-processor:

The entity "nbsp" was referenced, but not declared. XML checking finished.

<strong>Fig 8.- Result of the well-formedness checking for the xml in Fig 7</strong>

This workaround does the trick (the match="/" template adds the &nbsp; entity):

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" version="1.0" encoding="UTF-8" standalone="yes" /> <!-- Add an html entity to the xml character entities declaration. --> <xsl:template match="/"> <xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root [ <!ENTITY nbsp "&#160;"> ]> ]]> </xsl:text> <xsl:apply-templates select="*" /> </xsl:template> <xsl:template match="*"> <xsl:copy> <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" /> </xsl:copy> </xsl:template> <!-- Attribute-nodes and comment-nodes: Pass through without modifying --> <xsl:template match="@*|comment()|processing-instruction()"> <xsl:copy-of select="." /> </xsl:template> <!-- Text-nodes: Wrap them in a new node without escaping it. --> <!-- (note precondition: CDATA should be valid xml. --> <xsl:template match="text()"> <xsl:element name="old_text"> <xsl:value-of select="." disable-output-escaping="yes" /> </xsl:element> </xsl:template> </xsl:stylesheet>

<strong>Fig 9.- The xslt creates the entity declaration</strong>

Now, after applying this xslt to the <strong>Fig 6</strong> source xml, this is the intermediate xml:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root [ <!ENTITY nbsp "&#160;"> ]> <root><old_text> </old_text><well-formed-content><old_text> Some Text

more text and tags

, now with a non-breaking-space before the stop:&nbsp;. </old_text></well-formed-content><old_text> </old_text></root>

<strong>Fig 10.- Intermediate xml (xml from Fig 3 plus entity declaration)</strong>

You can use the xslt transformation from <strong>Fig 4</strong> to produce the final xml:

<pre class="lang-xml prettyprint-override"><?xml version="1.0" encoding="UTF-8" standalone="yes"?><root> <well-formed-content> Some Text

more text and tags

, now with a non-breaking-space before the stop: . </well-formed-content> </root>

<strong>Fig 11.- Final xml with html entites converted to UTF-8</strong>

<hr /><h3>Notes</h3>

For these examples I have used NetBeans 7.1.2 built-in XSLT processor (com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor)

<em>Disclaimer: I'm not an XML expert. I have the feeling that this should be even easier...</em>

Answer3:

To strip the CDATA and keep the tags as tags, you could use XSLT.

<strong>Given this XML input:</strong>

<?xml version="1.0" encoding="ISO-8859-1"?> <root> <child>Here is some text.</child> <child><![CDATA[Here is more text

with tags

.]]></child> </root>

<strong>Using this XSLT:</strong>

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="*"> <xsl:copy> <xsl:apply-templates select="*" /> <xsl:value-of select="text()" disable-output-escaping="yes"/> </xsl:copy> </xsl:template> </xsl:stylesheet>

<strong>Will return the following XML:</strong>

<?xml version="1.0" encoding="UTF-8"?> <root> <child>Here is some text.</child> <child>Here is more text

with tags

.</child> </root>

(Tested with Saxon HE 9.3.0.5 in oXygen 12.2)

Then you could use xPath to extract the contents of the p element:

/root/child/p

Answer4:

You can definitely remove the cdata from xml by using the regex to remove the desired content from your xml.

for example:

String s = "<sn><![CDATA[poctest]]></sn>"; s = s.replaceAll("!\\[CDATA", ""); s = s.replaceAll("]]", ""); s = s.replaceAll("\\[", "");

Result will be:

<sn><poctest></sn>

Please check,if this solves your issue.

Answer5:

Try this:

public static removeCDATA (String text) { String resultString = ""; Pattern regex = Pattern.compile("(?<!(<!\\[CDATA\\[))|((.*)\\w+\\W)"); Matcher regexMatcher = regex.matcher(text); while (regexMatcher.find()) { resultString += regexMatcher.group(); } return resultString; }

When I call this method with your test input <![CDATA[ Some Text

more text and tags

... ]]> method return Some Text

more text and tags

But I think this method without regular expressions will be more reliable. Something like this:

public static removeCDATA (String text) { s = s.trim(); if (s.startsWith("<![CDATA[")) { s = s.substring(9); int i = s.indexOf("]]>"); if (i == -1) throw new IllegalStateException("argument starts with <![CDATA[ but cannot find pairing ]]>"); s = s.substring(0, i); } return s; }

Recommend

  • RSSI from Bluetooth Low Energy (BLE) Tags?
  • adding a script tag dynamically with document.write
  • Writing Out a DOM as an XML File
  • list of nullable bool values is always nulls
  • SCNShape not drawing with UIBezierPath under certain circumstances
  • Jquery change image on hover when image and menu item 'title' equal eachother
  • Can I store name, address, and LAT/LONG from Google Places API in my own database
  • /.git/hooks/: No such file or directory protocol error: expected control record on Mac osx
  • Open world assumption and SPARQL in triple stores
  • Input field doesn't receive keyboard events when rendering with value property?
  • Linq-to-SQL to search only DATE portions of a date
  • remove question mark from 301 redirect using htaccess when the user enters the old URL
  • Are there possible approaches to map signal handling (c library) into a c++ exception?
  • How to set title name of the pdf. While viewing the Document(New Tab)
  • get all files in git diff in intellij
  • How to layout? (JFrame, JPanel, etc.)
  • Storing the Cursor for App Engine Pagination
  • How to make a dependent dropdown in codeigniter
  • Vue.js 2: Vue cannot find files from /assets folder (v-for)
  • Get all categories and items in category
  • date changes on export kendoGrid
  • Eric5: The OK button of 'new project' dialog is disable
  • Refresh JSF component after custom javascript Ajax call
  • Python sum values in tuple in a list in a dictionary?
  • Making Django.contrib.auth store plain-text password
  • SyntaxError: expected expression, got '.'
  • Unable to connect to Azure MySQL Database through Azure Function - C#
  • Slick: How can I combine a SQL LIKE statement with a SQL IN statement
  • Modifying native query cannot have named parameter bindings?
  • How to display converted time zones in a 'generic week' (Sunday thru Saturday)?
  • Connect to a local database from phpmyadmin with R
  • What is the difference between dynamically creating a script tag and statically embed a script tag?
  • Regex not working in java 1.5
  • Can I read another applications memory?
  • Rotating Towards Path in OpenGL
  • Typeahead.js does give me suggestions but doesn't select them
  • how to get the location(lat/lng) on google maps v3 from the location(x,y)
  • Using redis as an LRU cache for postgres
  • Bad automatic Triangulation with Mayavi for coloring a surface known only by its corner
  • PHP Permalinks.. how to change?