
Question:
I am parsing an html file using lxml.html....The html file contains tags with small case letters and also large case letters. A part of my code is shown below:
response = urllib2.urlopen(link)
html = response.read().decode('cp1251')
content_html = etree.HTML(html_1)
first_link_xpath = content_html.xpath('//TR')
print (first_link_xpath)
A small part of my HTML file is shown below:
<TR>
<TR vAlign="top" align="left">
<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
<TD></TD>
</TR>
</TR>
So when i run my above code for the below html sample, it gives an empty list. Then i tried to run this line first_link_xpath = content_html_1.xpath('//tr/node()')
, all the upper case tags were represented as \r\n\t\t\t\t'
in the output: What is the reason behind this issue??
NOte: If the question is not convincing please let me know for modification
Answer1:To follow up on unutbu's answer, I suggest you compare lxml
XML and HTML parsers, especially how they represent documents by asking a representation of the tree back using lxml.etree.tostring()
. You can see the different tags, tags case and hierarchy (which may be different than what a human would think ;)
$ python
>>> import lxml.etree
>>> doc = """<TR>
... <TR vAlign="top" align="left">
... <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
... <TD></TD>
... </TR>
... </TR>"""
>>> xmldoc = lxml.etree.fromstring(doc)
>>> xmldoc
<Element TR at 0x1e79b90>
>>> htmldoc = lxml.etree.HTML(doc)
>>> htmldoc
<Element html at 0x1f0baa0>
>>> lxml.etree.tostring(xmldoc)
'<TR>\n <TR vAlign="top" align="left">\n <!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>-->\n <TD/>\n </TR>\n </TR>'
>>> lxml.etree.tostring(htmldoc)
'<html><body><tr/><tr valign="top" align="left"><!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>--><td/>\n </tr></body></html>'
>>>
You can see that with the HTML parser, it created enclosing html
and body
tags, and there is an empty tr
node at the beginning, since in HTML a tr
cannot directly follow a tr
(the HTML fragment you provided is broken, either by a typo error, or the original document is also broken)
Then, again as suggested by unutbu, you can tryout the different XPath expressions:
>>> xmldoc.xpath('//tr')
[]
>>> xmldoc.xpath('//TR')
[<Element TR at 0x1e79b90>, <Element TR at 0x1f0baf0>]
>>> xmldoc.xpath('//TR/node()')
['\n ', <Element TR at 0x1f0baf0>, '\n ', <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, '\n ', <Element TD at 0x1f0bb40>, '\n ', '\n ']
>>>
>>> htmldoc.xpath('//tr')
[<Element tr at 0x1f0bbe0>, <Element tr at 0x1f0bc30>]
>>> htmldoc.xpath('//TR')
[]
>>> htmldoc.xpath('//tr/node()')
[<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0x1f0bbe0>, '\n ']
>>>
An indeed, as unutbu stressed, for HTML, XPath expressions should use lower-case tags to select elements.
To me, '\r\n\t\t\t\t' output is not an error, it's simply the whitespace between the various tr
and td
tags. For text content, if you don't want this whitespace, you can use lxml.etree.tostring(element, memthod="text", encoding=unicode).strip()
, where element
comes from XPath for example. (this works for leading and trailing whitespace).
(Note that the method
argument is important, by default, it will output the HTML representation as tested above)
>>> map(lambda element: lxml.etree.tostring(element, method="text", encoding=unicode), htmldoc.xpath('//tr'))
[u'', u'\n ']
>>>
And you can verify that the text representation is all whitespace.
Answer2:The HTML parser converters all tag names to lower case. This is why xpath('//TR')
returns an empty list.
I'm not able to reproduce the second problem, where upper case tags get printed as \r\n\t\t\t\t'
. Can you modify the code below to demonstrate the problem?
import lxml.etree as ET
content = '''\
<TR>
<TR vAlign="top" align="left">
<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
<TD></TD>
</TR>
</TR>'''
root = ET.HTML(content)
print(root.xpath('//TR'))
# []
print(root.xpath('//tr/node()'))
# [<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0xb77463ec>, '\n ']
print(root.xpath('//tr'))
# [<Element tr at 0xb77462fc>, <Element tr at 0xb77463ec>]