21954

python lxml using iterparse to edit and output xml

I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element.

Say we have this xml as an example:

<xml> <items> <pie>cherry</pie> <pie>apple</pie> <pie>chocolate</pie> </items> </xml>

What I would like to do while parsing is when I hit that xpath of "/xml/items/pie" is to add an element before pie, so it will turn out like this:

<xml> <items> <item id="1"><pie>cherry</pie></item> <item id="2"><pie>apple</pie></item> <item id="3"><pie>chocolate</pie></item> </items> </xml>

That output would need to be done by writing to a file line by line as I hit each tag and edit the xml at certain xpaths. I mean I could have it print the starting tag, the text, the attribute if it exists, and then the ending tag by hard coding certain parts, but that would be very messy and it be nice if there was a way to avoid that if possible.

Here's my guess code at this:

from lxml import etree path=[] count=0 context=etree.iterparse(file,events=('start','end')) for event, element in context: if event=='start': path.append(element.tag) if /'+'/'.join(path)=='/xml/items/pie': itemnode=etree.Element('item',id=str(count)) itemnode.text="" element.addprevious(itemnode)#Not the right way to do it of course #write/print out xml here. else: element.clear() path.pop()

Edit: Also, I need to run through fairly big files, so I have to use iterparse.

Answer1:

Here's a solution using iterparse(). The idea is to catch all tag "start" events, remember the parent (items) tag, then for every pie tag create an item tag and put the pie into it:

from StringIO import StringIO from lxml import etree from lxml.etree import Element data = """<xml> <items> <pie>cherry</pie> <pie>apple</pie> <pie>chocolate</pie> </items> </xml>""" stream = StringIO(data) context = etree.iterparse(stream, events=("start", )) for action, elem in context: if elem.tag == 'items': items = elem index = 1 elif elem.tag == 'pie': item = Element('item', {'id': str(index)}) items.replace(elem, item) item.append(elem) index += 1 print etree.tostring(context.root)

prints:

<xml> <items> <item id="1"><pie>cherry</pie></item> <item id="2"><pie>apple</pie></item> <item id="3"><pie>chocolate</pie></item> </items> </xml>

Answer2:

There is a more clean way to make modifications you need:

    <li>iterate over pie elements</li> <li>make an item element </li> <li>use replace() to replace a pie element with item</li> </ul>

    replace(self, old_element, new_element)

    Replaces a subelement with the element passed as second argument.

    <hr> from lxml import etree from lxml.etree import XMLParser, Element data = """<xml> <items> <pie>cherry</pie> <pie>apple</pie> <pie>chocolate</pie> </items> </xml>""" tree = etree.fromstring(data, parser=XMLParser()) items = tree.find('.//items') for index, pie in enumerate(items.xpath('.//pie'), start=1): item = Element('item', {'id': str(index)}) items.replace(pie, item) item.append(pie) print etree.tostring(tree, pretty_print=True)

    prints:

    <xml> <items> <item id="1"><pie>cherry</pie></item> <item id="2"><pie>apple</pie></item> <item id="3"><pie>chocolate</pie></item> </items> </xml>

    Answer3:

    I would suggest you to use an XSLT template, as it seems to match better for this task. Initially XSLT is a little bit tricky until you get used to it, if all you want is to generate some output from an XML, then XSLT is a great tool.

Recommend

  • How to extract the two elements in Selenium(python)?
  • Is it okay to use such xpath to find web elements?
  • How can I extract XML of a website and save in a file using Perl's LWP?
  • How to read xpath values from many HTML files in .Net?
  • Dojox/mvc/at model scope
  • iPhone dealing with xml vs soap vs JSON vs RESTful
  • C++ accessing vector
  • Mongodb update() vs. findAndModify() performace
  • Basic defensive programming [duplicate]
  • Zend Framework bassed projects
  • TFS - how do I sum child task hours to parent
  • ZipList with Scalaz
  • C++ Single function pointer for all template instances
  • How to access meteor package name inside package?
  • Python ImageIO Gif Set Delay Between Frames
  • How to make R's read_csv2() recognise the text characters properly
  • How to get latest version of a artifact on Bintray using JSONP
  • Tell Git to stop prompting me for conflicts when none really exist?
  • Invalid Date on validation Date of js
  • How do I superscript characters in a UIButton?
  • how to avoid repetitive constructor in children
  • How integrated is Collada to OpenGL ES
  • How to add git credentials to the build so it would be able to be used within a shell code?
  • Groovy: Unexpected token “:”
  • How to get Eclipse Oxygen to run on Java 9
  • How to create a file in java without a extension
  • Control modification in presentation layer
  • Get one-time binding to work for ng-if
  • Using $this when not in object context
  • What is Eclipse's Declaration View used for?
  • How do I fake an specific browser client when using Java's Net library?
  • How reduce the height of an mschart by breaking up the y-axis
  • Volley JsonObjectRequest send headers in GET Request
  • MySQL WHERE-condition in procedure ignored
  • Perl system calls when running as another user using sudo
  • Importing jscolor library in angular 2
  • Return words with double consecutive letters
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • Easiest way to encapsulate a HTML5 webpage into an android app?
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize