Merge multiple tags to a single one with python lxml


I've a python script to clean scraped html content, it uses BeautifulSoup4 and works pretty well. Recently I have decided to learn lxml but I found the tutorials are harder (for me) to follow. For example I use the following code to merge multiple <br /> tags into one, i.e, if there are more than one <br /> tags, remove all but keep just one:

from bs4 import BeautifulSoup, Tag data = 'foo<br /><br>bar.

foo<br/><br id="1"><br/>bar' soup = BeautifulSoup(data) for br in soup.find_all("br"): while isinstance(br.next_sibling, Tag) and br.next_sibling.name == 'br': br.next_sibling.extract() print soup <html><body>




How do I achieve this similar in lxml? Thanks,


You could try .drop_tag() method to remove duplicate consecutive occurences of <br/> tag:

from lxml import html doc = html.fromstring(data) for br in doc.findall('.//br'): if br.tail is None: # no text immediately after <br> tag for dup in br.itersiblings(): if dup.tag != 'br': # don't merge if there is another tag inbetween break dup.drop_tag() if dup.tail is not None: # don't merge if there is a text inbetween break print(html.tostring(doc)) # -> <div>





