Question

鉴于我有字符串，我怎么能删除所有标签。例如：

string = hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags
>>>> hello there I want to strip all these tags

Answer 1

text属性是最直接的属性，但它只是逐字复制文本节点，因此你得到了

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags""")
>>> soup.text
u'hellothere I  want to  strip all these tags'

您可以使用

挤压所有空格

>>> ' '.join(soup.text.split())
u'hellothere I want to strip all these tags'

现在，'hello'和'there之间缺少的空间是一个棘手的问题，因为如果<tag1>为<b>，则用户代理将其呈现为hello < b>那里，没有任何干预空间;一个人需要解析CSS以了解哪些元素应该是内联的，哪些元素不是内联的。

但是，如果我们允许每个非文本节点（和结束标记）替换为空格，那么粗略的是用soup.findChildren分别搜索所有文本节点，将它们分别拆分，合并这些列表itertools.chain然后join将它们与一个空格一起作为分隔符：

>>> from itertools import chain
>>> words = chain(*(i.split() for i in soup.findChildren(text=True)))
>>> ' '.join(words)
u'hello there I want to strip all these tags'

在python中删除标记

1 个答案: