23697

Test if document is well formed before parsing

Question:

I need to analyze a few thousand XML documents to see if some of them contains a certain construct. The problem is that some of the documents doesn't contain well formed XML.

The basic idea was to use fn:collection() and search inside nodes returned. But this only works if all documents in the collection are well formed.

Is it possible to do something similar but only parsing the well formed documents?

<strong>This is my XSLT, simplified, which works if all documents in $dir are well formed:</strong>

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text"/> <xsl:variable name="dir" as="xs:string">file:/c:/path/to/files/</xsl:variable> <xsl:variable name="files" select="concat($dir, '?select=*.xml')" as="xs:string"/> <xsl:template match="/"> <xsl:variable name="docs" select="collection($files)"/> <xsl:variable name="names" select=" for $i in $docs return distinct-values($i//*[exists(@an-attribute-to-find)]/local-name())"/> <xsl:value-of select="distinct-values($names)" separator="&#x0a;"/> </xsl:template> </xsl:stylesheet>

Would it be possible to do something like this without manually sorting out the non well formed documents before transformation starts? Maybe you have a better suggestion to a solution?

Answer1:

At present this is best done out of XSLT.

It can be done in XSLT if you provide as an exrternal parameter (<xsl:param>) to the transformation a list of all filenames to be processed -- then the transformation would use the standard XPath 2.0 function doc-available() and operate only on the document nodes returned by this function.

Answer2:

You could use the doc-available function to tell you if a document is well-formed.

Answer3:

You could use <a href="http://vrici.lojban.org/~cowan/XML/tagsoup/" rel="nofollow"><strong>TagSoup</strong></a> to ensure that all of the documents are well-formed.

If you are using Saxon, <a href="http://vrici.lojban.org/~cowan/XML/tagsoup/tsaxon/" rel="nofollow">you can make TagSoup your parser by adding the following option</a>:

<blockquote>

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

</blockquote>

Recommend

  • finding maximum depth of chapter
  • Duplicate Element x number of times with XSLT
  • Implicit property animations do not work with CAReplicatorLayer?
  • Change the width of the JQM panels
  • What's a fast (non-loop) way to apply a dict to a ndarray (meaning use elements as keys and rep
  • how to display   in Mozilla using XSL.
  • JSF2.0 + Primefaces 3.0.1 + jquery 1.6.4 + p:commandLink + IE8 throws Unexpected call to method or p
  • Android Lock Screen C# .NET Replica
  • Feature detection of foreignObject in SVG
  • XSD with multi occurrences unordered
  • Splash Screen will not display
  • XSLT foreach repeating nodes to flat
  • JBoss External Properties Files in Classpath
  • Android - Material Design - NavigationView - How to put vertical scroll?
  • Jetty 9 HashLoginService
  • Why Encoding.ASCII != ASCIIEncoding.Default in C#?
  • Installing Apache MyFaces 2 on WildFly 8.2.0
  • Jquery UI tool tip close icon
  • Read a local file using javascript
  • ImageMagick, replace semi-transparent white with opaque white
  • Cannot connect to cassandra from Spark
  • Pass value from viewmodel to script in zk
  • Encrypt data by using a public key in c# and decrypt data by using a private key in php
  • Array.prototype.includes - not transformed with babel
  • Optimizing database types to compact database (SQLite)
  • Cross-Platform Protobuf Serialization
  • SSO with signing and signature validation doesn't work
  • Deserializing XML into class C#
  • Alternatives to the OPTIONAL fallback SPARQL pattern?
  • Do I've to free mysql result after storing it?
  • Warning: Can't call setState (or forceUpdate) on an unmounted component
  • bootstrap to use multiple ng-app
  • How to get icons for entities from eclipse?
  • Turn off referential integrity in Derby? is it possible?
  • JaxB to read class hierarchy
  • costura.fody for a dll that references another dll
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?