从python中的字符串中删除所有html行/代码

时间:2014-07-23 01:28:36

标签: python email

给定从电子邮件正文解析的以下字符串...

s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
    <p>there also might be new lines like this that need to be removed</p>

    <p> and even other lines like this all the way down here with whitespace after being  parsed from email that need to be removed.</p>

    But this is still okay."

如何从字符串中删除所有html代码和行以简单地返回&#34;保持所有这一切仍然很好但是这仍然没问题。&#34;在一条线上?我看过漂白剂和lxml,但他们只是删除了html&lt;&gt;并且回到里面的东西,而我不想要任何东西。

3 个答案:

答案 0 :(得分:1)

您仍然可以使用lxml来获取所有根元素的文本节点:

import lxml.html

html = '''
    Keep all of this <h1>But remove this including the tags</h1> this is still good, but
    <p>there also might be new lines like this that need to be removed</p>

    <p> and even other lines like this all the way down here with whitespace after being  parsed from email that need to be removed.</p>

    But this is still okay.
'''

root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())

似乎工作正常:

>>> text
'Keep all of this this is still good, but But this is still okay.'

答案 1 :(得分:0)

简单的解决方案,无需外部包:

import re
while '<' in s:
    s = re.sub('<.+?>.+?<.+?>', '', s)

效率不高,因为它多次遍历目标字符串,但应该可以正常工作。请注意,字符串上必须绝对没有<>个字符。

答案 2 :(得分:0)

这一个?

import re
s = # Your string here

print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)

编辑:刚刚为@BoppreH做了一些修改,虽然有额外的空间。