给定从电子邮件正文解析的以下字符串...
s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay."
如何从字符串中删除所有html代码和行以简单地返回&#34;保持所有这一切仍然很好但是这仍然没问题。&#34;在一条线上?我看过漂白剂和lxml,但他们只是删除了html&lt;&gt;并且回到里面的东西,而我不想要任何东西。
答案 0 :(得分:1)
您仍然可以使用lxml来获取所有根元素的文本节点:
import lxml.html
html = '''
Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay.
'''
root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())
似乎工作正常:
>>> text
'Keep all of this this is still good, but But this is still okay.'
答案 1 :(得分:0)
简单的解决方案,无需外部包:
import re
while '<' in s:
s = re.sub('<.+?>.+?<.+?>', '', s)
效率不高,因为它多次遍历目标字符串,但应该可以正常工作。请注意,字符串上必须绝对没有<
或>
个字符。
答案 2 :(得分:0)
这一个?
import re
s = # Your string here
print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)
编辑:刚刚为@BoppreH做了一些修改,虽然有额外的空间。