84354

xpath querying when xml format varies

I have a series of variable types like:

abc1A, abc1B, abc3B, ... xyz1A, xyz2A, xyz3C, ... data1C, data2A, ...

Stored in a variety of xml formats:

<area name="DataMap"> <int name="number" nullable="true"> <case var="abc2,abc3,abc5">11</case> <case var="abc4,abc6*">8</case> <case var="data1,xyz7,xyz8">22</case> <case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case> <case var="xyz{6,4A,4B,4C}">20</case> <case var="other01">15</case> </int> </area>

I'm hoping to query what an instance like xyz5A, for example, maps to. The query should return 24, but I don't know ahead of time if its reference in the xml node is explicit as in "xyz4A", or via a wildcard like "xyz4*", or in curly braces like above.

This queries for strings on that line and will return a hit successfully:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"xyz")][contains(@var,"5A")]'

But it also returns a hit for data5A which is not incorrect:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"data")][contains(@var,"5A")]'

Are there xpath/other query constructs that parse the inconsistent (but I assume valid) xml above? I only seem to be able to query against explicit string matches vs. the wildcard and curly braced formats.

Answer1:

Being in bash/perl you are likely bound to libxml. libxml doesn't support XPath 2.0. There are many questions on SO about XPath/XSLT 2.0 with libxml/libxslt and Perl.

XPath 1.0 has a variety (a small one I have to admit) of string functions and you could try to stack them up together. I experimented for a bit and neither did I like the result not did I succeed to cover all possible cases. You would have "ugly" constructs like:

... or (contains(@var, ',xyz{') and contains(substring-before(substring-after(@var, ',xyz{'), '}'), '5A') and (contains(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A,') or starts-with(substring-after(@var, ',xyz{'), '5A,') or starts-with(substring-after(@var, ',xyz{'), '5A}') or substring-after(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A') = '')) or ...

And then you would realize that substring-* functions work off of the first occurrence of the matching string and you need even more layers of ands and ors to handle cases like yours:

<case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case>

where there are multiple xyz{ and the one you need is not known to be the first one.

<strong>I think this is the case where you forget you have an XML and just do what Perl is good for and treat it as text</strong>. As much as I like XML-aware tools for XML processing and data extraction you will likely be better off with regexp and string manipulations in the language that was designed for it.

Answer2:

I guess the smartest thing would be to iterate over all variables and programmatically find the matches, not asking XPath to do it.

Barring that, I have at least a few thoughts on the braces; unfortunately, they probably don't help all that much for the * question.

It seems that there are perl XPath implementations where you could write .../case[@var =~ /some_regex/], maybe .../case["xyz4A" =~ to_regex(@var)], and maybe even .../case[explode_braces(@var) =~ /(^|,)xyz4A(,|$)/] (with a suitably written explode_braces function, of course). See http://www.perlmonks.org/?node_id=831612, for example. I would expect the explode_braces way to work much, much easier than the first alternative - and I do use regular expressions quite a lot. Then again, you seem to use bash-regexes, and transforming those to a perl regex should also be relatively straightforward, so if the second idea, works, you may be good to go.

If that does not work, maybe hook into your XML parser or right before it and fix this horrible XML design by expanding the braces?

$input =~ s/\bvar="([^"]*)"}/'var="'+explode_braces($2)+'"'/eg;

(Or something very similar, sorry, I haven't written much perl in the last years. Also, this assumes your xml only uses one type of attribute quotes, but that should be easy to fix, and that the only place where var=" is found is in these attributes, which may be a much harder limitation.)

Recommend

  • Highcharts - Date format from MS SQL to PHP issue
  • How to make Observable that emit character after 1 second of time interval
  • Why are symfony DOMCrawler objects not properly passed between dependent phpunit tests?
  • Excel 2013 Windows Class Names
  • Ehcache Cache Server + BlockingCache ?
  • Eclipse won't start
  • Why shared pointer assignment does 'swap'?
  • How do I retrieve the text in a table column using Selenium RC?
  • Oops! Google Chrome could not connect to localhost:8085
  • Failing to get duration of youtube video using xpath
  • Searching an XML file using PHP [closed]
  • parsing xml and html page with lxml and requests package in python
  • numpy 64bit support in PTVS and numpy System.Int64 casting
  • ng-repeat not working with table but works with list
  • Cordova Apache wrong module path
  • How do you keep a running instance for Google App Engine
  • WordPress > setting permalink option via script buggy?
  • Correct implementation of List Iterator methods
  • Invalid object name 'dbo.Item'
  • Jenkins: FATAL: Could not initialize class hudson.util.ProcessTree$UnixReflection
  • TextToSpeech.setEngineByPackageName() triggers NullPointerException
  • Debugging ASP.NET on a built-in web server suddenly stops
  • Java Scanner input dilemma. Automatically inputs without allowing user to type
  • Using $this when not in object context
  • Master page gives error
  • How do I fake an specific browser client when using Java's Net library?
  • How reduce the height of an mschart by breaking up the y-axis
  • Apache 2.4 and php-fpm does not trigger apache http basic auth for php pages
  • How to recover from a Spring Social ExpiredAuthorizationException
  • Fill an image in a square container while keeping aspect ratio
  • Perl system calls when running as another user using sudo
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • embed rChart in Markdown
  • Can't mass-assign protected attributes when import data from csv file
  • sending mail using smtp is too slow
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • Reading document lines to the user (python)
  • How can I use `wmic` in a Windows PE script?
  • Unable to use reactive element in my shiny app