Question

由于我想删除html网站中的重复占位符，我使用BeautifulSoup的.next_sibling运算符。只要重复项位于同一行，就可以正常工作（参见数据）。但有时它们之间有一条空行 - 所以我想.next_sibling忽略它们（看看data2）

这就是代码：

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
    while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
        p.next_sibling.decompose()
print(soup)

数据输出符合预期：

<html><head></head><body><p>method-removed-here</p></body></html>

data2的输出（需要修复）：

<html><head></head><body><p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
</body></html>

我无法在BeautifulSoup4文档中找到有用的信息，而.next_element也不是我想要的。

Answer 1

我可以通过解决方法解决此问题。问题在google-group for BeautifulSoup中描述，他们建议对html文件使用预处理器：

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

这不是最好的解决方案，而是一个。

Answer 2

也不是一个很好的解决方案，但这对我有用

def get_sibling(element):
    sibling = element.next_sibling
    if sibling == "\n":
        return get_sibling(sibling)
    else:
        return sibling

Answer 3

通过使其成为一般性来改进一点neurosnap答案：

def next_elem(element, func):
    new_elem = getattr(element, func)
    if new_elem == "\n":
        return next_elem(new_elem, func)
    else:
        return new_elem

现在您可以使用它调用任何函数，例如：

next_elem(element, 'previous_sibling')

Answer 4

使用find_next_sibling()代替next_sibling。对于find_previous_sibling()而不是previous_sibling也是如此。

原因：next_sibling不一定返回下一个html标签，而是返回下一个“汤元素”。通常，这只是一个换行符，但可以更多。另一方面，find_next_sibling()返回下一个html标签，而忽略标签之间的空格和其他粗体。

我对您的代码进行了一些重组以进行此演示。我希望它在语义上是相同的。

带有next_sibling的代码，表明您所描述的行为相同（适用于data，但不适用于data2）

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.next_sibling
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

带有find_next_sibling()的代码，可同时用于data和data2

soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.find_next_sibling()
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

beautifulsoup的其他部分具有相同的行为（返回所有汤元素，包括空格）：BeautifulSoup .children or .content without whitespace between tags

如何在python中的BeautifulSoup4中使用.next_sibling时忽略空行

4 个答案: