50143

Merge multiple tags to a single one with python lxml

Question:

I've a python script to clean scraped html content, it uses BeautifulSoup4 and works pretty well. Recently I have decided to learn lxml but I found the tutorials are harder (for me) to follow. For example I use the following code to merge multiple <br /> tags into one, i.e, if there are more than one <br /> tags, remove all but keep just one:

from bs4 import BeautifulSoup, Tag data = 'foo<br /><br>bar.

foo<br/><br id="1"><br/>bar' soup = BeautifulSoup(data) for br in soup.find_all("br"): while isinstance(br.next_sibling, Tag) and br.next_sibling.name == 'br': br.next_sibling.extract() print soup <html><body>

foo<br/>bar.

foo<br/>bar

</body></html>

How do I achieve this similar in lxml? Thanks,

Answer1:

You could try .drop_tag() method to remove duplicate consecutive occurences of <br/> tag:

from lxml import html doc = html.fromstring(data) for br in doc.findall('.//br'): if br.tail is None: # no text immediately after <br> tag for dup in br.itersiblings(): if dup.tag != 'br': # don't merge if there is another tag inbetween break dup.drop_tag() if dup.tail is not None: # don't merge if there is a text inbetween break print(html.tostring(doc)) # -> <div>

foo<br>bar.

foo<br>bar

</div>

Recommend

  • Zend Framework 2, Module Redirect
  • Why does PHP appear to evaluate this condition incorrectly?
  • Contact form problem - I do receive messages, but no contents (blank page)
  • MySQL multiple IN conditions to subquery with same table
  • How to merge keras sequential models with same input?
  • Is it possible to get the word under the mouse cursor in a ``?
  • hibernate sets dirty flag (and issues update) even though client did not change value
  • Django model for a Postgres view
  • BeautifulSoup difference between findAll and findChildren
  • NHibernate manually control fetching
  • How can Delete be both a DDL and a DML statement
  • How to retrieve information from antrun back to maven?
  • With Hadoop, can I create a tasktracker on a machine that isn't running a datanode?
  • Meteor helpers not available in Angular template
  • Extracting HTML between tags
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Android screen density dpi vs ppi
  • Regex thinks I'm nesting, but I'm not
  • What is the “return” in scheme?
  • Align navbar back button on right side
  • Bug in WPF DataGrid
  • Window Size for Mac application
  • Build own AppleScript numerical error handling
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • How to disable jQuery.jplayer autoplay?
  • Benchmarking RAM performance - UWP and C#
  • Acquiring multiple attributes from .xml file in c#
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • How to get Windows thread pool to call class member function?
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • How can I remove ASP.NET Designer.cs files?
  • Is there any way to bind data to data.frame by some index?
  • costura.fody for a dll that references another dll
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?
  • To Get the radio button value in ruby on rails
  • java string with new operator and a literal
  • How to push additional view controllers onto NavigationController but keep the TabBar?