17148

upper case html tags encoded in lxml

Question:

I am parsing an html file using lxml.html....The html file contains tags with small case letters and also large case letters. A part of my code is shown below:

response = urllib2.urlopen(link) html = response.read().decode('cp1251') content_html = etree.HTML(html_1) first_link_xpath = content_html.xpath('//TR') print (first_link_xpath)

A small part of my HTML file is shown below:

<TR> <TR vAlign="top" align="left"> <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>--> <TD></TD> </TR> </TR>

So when i run my above code for the below html sample, it gives an empty list. Then i tried to run this line first_link_xpath = content_html_1.xpath('//tr/node()') , all the upper case tags were represented as \r\n\t\t\t\t' in the output: What is the reason behind this issue??

NOte: If the question is not convincing please let me know for modification

Answer1:

To follow up on unutbu's answer, I suggest you compare lxml XML and HTML parsers, especially how they represent documents by asking a representation of the tree back using lxml.etree.tostring(). You can see the different tags, tags case and hierarchy (which may be different than what a human would think ;)

$ python >>> import lxml.etree >>> doc = """<TR> ... <TR vAlign="top" align="left"> ... <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>--> ... <TD></TD> ... </TR> ... </TR>""" >>> xmldoc = lxml.etree.fromstring(doc) >>> xmldoc <Element TR at 0x1e79b90> >>> htmldoc = lxml.etree.HTML(doc) >>> htmldoc <Element html at 0x1f0baa0> >>> lxml.etree.tostring(xmldoc) '<TR>\n <TR vAlign="top" align="left">\n <!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>-->\n <TD/>\n </TR>\n </TR>' >>> lxml.etree.tostring(htmldoc) '<html><body><tr/><tr valign="top" align="left"><!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>--><td/>\n </tr></body></html>' >>>

You can see that with the HTML parser, it created enclosing html and body tags, and there is an empty tr node at the beginning, since in HTML a tr cannot directly follow a tr (the HTML fragment you provided is broken, either by a typo error, or the original document is also broken)

Then, again as suggested by unutbu, you can tryout the different XPath expressions:

>>> xmldoc.xpath('//tr') [] >>> xmldoc.xpath('//TR') [<Element TR at 0x1e79b90>, <Element TR at 0x1f0baf0>] >>> xmldoc.xpath('//TR/node()') ['\n ', <Element TR at 0x1f0baf0>, '\n ', <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, '\n ', <Element TD at 0x1f0bb40>, '\n ', '\n '] >>> >>> htmldoc.xpath('//tr') [<Element tr at 0x1f0bbe0>, <Element tr at 0x1f0bc30>] >>> htmldoc.xpath('//TR') [] >>> htmldoc.xpath('//tr/node()') [<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0x1f0bbe0>, '\n '] >>>

An indeed, as unutbu stressed, for HTML, XPath expressions should use lower-case tags to select elements.

To me, '\r\n\t\t\t\t' output is not an error, it's simply the whitespace between the various tr and td tags. For text content, if you don't want this whitespace, you can use lxml.etree.tostring(element, memthod="text", encoding=unicode).strip(), where element comes from XPath for example. (this works for leading and trailing whitespace). (Note that the method argument is important, by default, it will output the HTML representation as tested above)

>>> map(lambda element: lxml.etree.tostring(element, method="text", encoding=unicode), htmldoc.xpath('//tr')) [u'', u'\n '] >>>

And you can verify that the text representation is all whitespace.

Answer2:

The HTML parser converters all tag names to lower case. This is why xpath('//TR') returns an empty list.

I'm not able to reproduce the second problem, where upper case tags get printed as \r\n\t\t\t\t'. Can you modify the code below to demonstrate the problem?

import lxml.etree as ET content = '''\ <TR> <TR vAlign="top" align="left"> <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>--> <TD></TD> </TR> </TR>''' root = ET.HTML(content) print(root.xpath('//TR')) # [] print(root.xpath('//tr/node()')) # [<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0xb77463ec>, '\n '] print(root.xpath('//tr')) # [<Element tr at 0xb77462fc>, <Element tr at 0xb77463ec>]

Recommend

  • Pandas multi-index subtract from value based on value in other column
  • creating password field in oracle
  • Simple linked list-C
  • Reading a file into a multidimensional array
  • CakePHP ACL tutorial initDB function warnings
  • Android application: how to use the camera and grab the image bytes?
  • Query to find the duplicates between the name and number in table
  • How can I speed up CURL tasks?
  • How do I exclude a dependency in provided scope when running in Maven test scope?
  • Jackson Parser: ignore deserializing for type mismatch
  • Is there a way to do normal logging with EureakLog?
  • Time complexity of a program which involves multiple variables
  • Admob requires api-13 or later can I not deploy on old API-8 phones?
  • Test if a set exists before trying to drop it
  • How to use remove-erase idiom for removing empty vectors in a vector?
  • How to clear text inside text field when radio button is select
  • Spark fat jar to run multiple versions on YARN
  • Read a local file using javascript
  • Avoid links criss cross / overlap in d3.js using force layout
  • Asynchronous UI Testing in Xcode With Swift
  • Scrapy recursive link crawler
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Why is an OPTIONS request sent to the server?
  • Spring Data JPA custom method causing PropertyReferenceException
  • Using $this when not in object context
  • NetLogo BehaviorSpace - Measure runs using reporters
  • How to make a tree having multiple type of nodes and each node can have multiple child nodes in java
  • How do I fake an specific browser client when using Java's Net library?
  • How reduce the height of an mschart by breaking up the y-axis
  • Perl system calls when running as another user using sudo
  • SVN: Merging two branches together
  • Hibernate gives error error as “Access to DialectResolutionInfo cannot be null when 'hibernate.
  • Unanticipated behavior
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • Turn off referential integrity in Derby? is it possible?
  • Add sale price programmatically to product variations
  • How can i traverse a binary tree from right to left in java?
  • Unable to use reactive element in my shiny app
  • How do I use LINQ to get all the Items that have a particular SubItem?