HTML tag replacement using regex and python


I have a Python script that will look at an HTML file that has the following format:

<DOC> <HTML> ... </HTML> </DOC> <DOC> <HTML> ... </HTML> </DOC>

How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?


search and replace with this regex: search for: <.*?> replace with: "


For what you are trying to accomplish I would use BeautifulSoup rather than regex.

<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>


Check out <a href="http://codespeak.net/lxml/" rel="nofollow">lxml</a>, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.

from lxml import html h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>') h.find('*').drop_tag() print(html.tostring(h, encoding=unicode)) <doc>Hello World!</doc>


