Question

给定从电子邮件正文解析的以下字符串...

s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
    <p>there also might be new lines like this that need to be removed</p>

    <p> and even other lines like this all the way down here with whitespace after being  parsed from email that need to be removed.</p>

    But this is still okay."

如何从字符串中删除所有html代码和行以简单地返回＆＃34;保持所有这一切仍然很好但是这仍然没问题。＆＃34;在一条线上？我看过漂白剂和lxml，但他们只是删除了html＆lt;＆gt;并且回到里面的东西，而我不想要任何东西。

Answer 1

您仍然可以使用lxml来获取所有根元素的文本节点：

import lxml.html

html = '''
    Keep all of this <h1>But remove this including the tags</h1> this is still good, but
    <p>there also might be new lines like this that need to be removed</p>

    <p> and even other lines like this all the way down here with whitespace after being  parsed from email that need to be removed.</p>

    But this is still okay.
'''

root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())

似乎工作正常：

>>> text
'Keep all of this this is still good, but But this is still okay.'

Answer 2

简单的解决方案，无需外部包：

import re
while '<' in s:
    s = re.sub('<.+?>.+?<.+?>', '', s)

效率不高，因为它多次遍历目标字符串，但应该可以正常工作。请注意，字符串上必须绝对没有<或>个字符。

Answer 3

这一个？

import re
s = # Your string here

print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)

编辑：刚刚为@BoppreH做了一些修改，虽然有额外的空间。

从python中的字符串中删除所有html行/代码

3 个答案: