53657

HTML tag replacement using regex and python

Question:

I have a Python script that will look at an HTML file that has the following format:

<DOC> <HTML> ... </HTML> </DOC> <DOC> <HTML> ... </HTML> </DOC>

How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?

Answer1:

search and replace with this regex: search for: <.*?> replace with: "

Answer2:

For what you are trying to accomplish I would use BeautifulSoup rather than regex.

<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>

Answer3:

Check out <a href="http://codespeak.net/lxml/" rel="nofollow">lxml</a>, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.

from lxml import html h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>') h.find('*').drop_tag() print(html.tostring(h, encoding=unicode)) <doc>Hello World!</doc>

Recommend

  • Access 2007 forms with parameterized RecordSource
  • Opening links in a new tab and only the new tab
  • Unload image of UIImageView thats offscreen
  • jParallax trouble
  • Dependable views in Ember
  • window.onbeforeunload in javascript
  • XBee Linux Serial Port on Rasberry Pi
  • Opening two instances of InAppBrowser (_system and _blank) prevents events from triggering
  • File loader changed image file name but not the file name in HTML file
  • Specifying virtual keyboard type for EditText in XML
  • Avoid registering duplicate broadcast receivers in Android
  • Efficient User-Agent Regex to find Safari in Python
  • Java making confirming exit
  • Can my PDF ping my server when it is opened?
  • Outputting SharePoint Hyperlink Column as URL
  • Creating PDF from TIFF image using iText
  • Two Tables Serving as one Model in Rails
  • NUnit 3.0 TestCase const custom object arguments
  • Plotting line graph with factors in R
  • Approximate Order-Preserving Huffman Code
  • how to save the state in userdefaults of accessory checkmark-iphone
  • Can you perform a UNION without a subquery in SQLAlchemy?
  • Google Custom Search with transparent background
  • Extracting HTML between tags
  • FFmpeg Conversion Error
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Disabling Alt-F4 on a Win Forms NotifyIcon
  • Insert into database using onclick function
  • Regex thinks I'm nesting, but I'm not
  • What is Eclipse's Declaration View used for?
  • javascript inside java/jsp code
  • MySQL WHERE-condition in procedure ignored
  • Can I make an Android app that runs a web view in Chrome 39?
  • Web-crawler for facebook in python
  • SVN: Merging two branches together
  • Android Studio and gradle
  • trying to dynamically update Highchart column chart but series undefined
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • How can i traverse a binary tree from right to left in java?
  • java string with new operator and a literal