Question

目前我的代码可以执行以下操作：

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

除了我不想丢弃无效标签内的内容。如何在删除标签但在调用soup.renderContents（）时保留内容？

Answer 1

当前版本的BeautifulSoup库在Tag对象上有一个名为replaceWithChildren（）的未记录方法。所以，你可以这样做：

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

看起来它的行为与你想要的一样，并且是相当简单的代码（尽管它确实通过DOM进行了一些传递，但这很容易被优化。）

Answer 2

我使用的策略是将标记替换为其内容（如果它们是NavigableString类型，如果它们不是，则递归到它们中并用NavigableString替换它们的内容等。尝试这样：

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

结果是：

<p>Good, bad, and ugly</p>

我在另一个问题上给出了同样的答案。它似乎出现了很多。

Answer 3

虽然评论中其他人已经提到了这个问题，但我想我会发布一个完整的答案，说明如何使用Mozilla的Bleach。就个人而言，我认为这比使用BeautifulSoup要好得多。

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

Answer 4

我有一个更简单的解决方案，但我不知道它是否有缺点。

更新：有一个缺点，请参阅Jesse Dhillon的评论。另外，另一个解决方案是使用Mozilla的Bleach而不是BeautifulSoup。

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

这也将根据需要打印<div><p>Hello there my friend!</p></div>。

Answer 5

你可以使用soup.text

.text删除所有标签并连接所有文本。

Answer 6

您可能必须在删除标签之前将标签的子项移动为标记父项的子项 - 这是您的意思吗？

如果是这样，那么，在正确的地方插入内容是很棘手的，这样的事情应该有效：

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

使用示例值，根据需要打印<div><p>Hello there my friend!</p></div>。

Answer 7

建议的答案似乎都不适用于BeautifulSoup。这是一个与BeautifulSoup 3.2.1一起使用的版本，并且在连接来自不同标签的内容时也插入空格而不是连接单词。

def strip_tags(html, whitelist=[]):
    """
    Strip all HTML tags except for a list of whitelisted tags.
    """
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name not in whitelist:
            tag.append(' ')
            tag.replaceWithChildren()

    result = unicode(soup)

    # Clean up any repeated spaces and spaces like this: '<a>test </a> '
    result = re.sub(' +', ' ', result)
    result = re.sub(r' (<[^>]*> )', r'\1', result)
    return result.strip()

示例：

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a'])
# result: u'<a>test</a> testing again'

Answer 8

使用unwrap。

Unwrap将删除多次出现的标签之一并仍然保留内容。

示例：

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

Answer 9

这是更好的解决方案，没有任何麻烦和样板代码来过滤掉保留内容的标签。让我们说你要删除父标签中的任何子标签，只想保留内容/文本，你可以简单做：

for p_tags in div_tags.find_all("p"):
    print(p_tags.get_text())

就是这样，您可以使用父标签中的所有br或i b标签免费获得干净的文本。

Answer 10

这是一个老问题，但只是说更好的方法。首先，BeautifulSoup 3 *不再开发，所以你应该使用BeautifulSoup 4 *，所谓的bs4。

此外，lxml只具有您需要的功能：Cleaner class具有属性remove_tags，您可以将其设置为在内容被提升到父标记时将被删除的标记。

Answer 11

Here is a python 3 friendly version of this function:

from bs4 import BeautifulSoup, NavigableString
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

使用BeautifulSoup删除标记但保留其内容

11 个答案: