I have a Python script that will look at an HTML file that has the following format:
<DOC> <HTML> ... </HTML> </DOC> <DOC> <HTML> ... </HTML> </DOC>
How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?Answer1:
search and replace with this regex: search for: <.*?> replace with: "Answer2:
For what you are trying to accomplish I would use BeautifulSoup rather than regex.
<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>Answer3:
Check out <a href="http://codespeak.net/lxml/" rel="nofollow">lxml</a>, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.from lxml import html h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>') h.find('*').drop_tag() print(html.tostring(h, encoding=unicode)) <doc>Hello World!</doc>