Question

我有一个HTML页面，其中部分树看起来像这样（请参阅下面包含html的代码段）：

                       <body>
                       |    |
                       |    |
     <div id="Kentucky">    <div id="NewOrleans">
             |                      |
             |                      |
          Bourbon                Bourbon

为什么BeautifulSoup表示＆＃34;左？＆＃34;波旁是肯塔基州的一个孩子＆＃34;肯塔基州＆＃34; （正确）和＆＃34; NewOrleans＆＃34; （不正确的）？

反之亦然，正确的波本威士忌是肯塔基州的孩子。（不正确的）。

在页面中包含不同的html元素，所有这些元素都具有相同的文本并不少见（例如在页眉，页脚）。但是现在，在我为某些文本模式执行find_all（）之后，在使用header.children或footer.children来正确识别文本元素是否是其中任何一个的子元素时，我无法信任BeautifulSoup。

（就像公司一样，工程和营销部门都声称某个员工属于他们，只因为她的名字是＆＃34; Sarah＆＃34; - 可能有多个公司中的Sarahs - first_name属性只是该对象的众多属性之一，它不应仅仅确定身份。）

可以避免这样的事情，或者，找到另一种方法一个元素的正确孩子？

请注意，NavigableString类的MRO以＆＃39; str＆＃39;开头：

<class 'str'>, <class 'bs4.element.PageElement'>, <class 'object'>

我猜似乎表明问题的原因是BeautifulSoup 使用字符串比较来确定元素之间的相等性（或身份匹配）。

无论这是否确实是问题，是否有替代方案，或修复/补丁？

谢谢！

代码：

import re
from bs4 import BeautifulSoup

TEST_HTML = """<!doctype html>
<head><title>A title</title></head>
<html>
   <body>
      <div id="Kentucky">Bourbon</div>
      <div id="NewOrleans">Bourbon</div>
   </body>
</html>
"""

def test():
    soup = BeautifulSoup(TEST_HTML)

    # search for "Bourbon"
    re_pattern = re.compile('bourbon', re.IGNORECASE)
    text_matches = soup.find_all(text=re_pattern)

    # print verbose debug output...
    for text_match in text_matches:
        print('id: {} - class: {} - text: {} - parent attrs: {}'.\
              format(id(text_match),
                     text_match.__class__.__name__,
                     text_match.string,
                     text_match.parent.attrs))
    # id: 140609176408136 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'Kentucky'}
    # id: 140609176408376 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'NewOrleans'}


    kentucky_match = text_matches[0]
    kentucky_parent = kentucky_match.parent

    new_orleans_match = text_matches[1]
    new_orleans_parent = new_orleans_match.parent

    # confirm -> all ok...
    print(kentucky_parent.attrs)      # {'id': 'Kentucky'}
    print(new_orleans_parent.attrs)   # {'id': 'NewOrleans'}

    # get a list of all the children for both kentucky and new orleans
    # (this tree traversal is all ok)
    ky_children = [child for child in kentucky_parent.children]
    no_children = [child for child in new_orleans_parent.children]

    # confirm -> all ok...
    print([id(child) for child in ky_children])   # [140609176408136]
    print([id(child) for child in no_children])   # [140609176408376]


    # now, here's the problem!!!
    print(kentucky_match in no_children)      # True  -> wrong!!!!!!!
    print(kentucky_match in ky_children)      # True

    print(new_orleans_match in no_children)   # True
    print(new_orleans_match in ky_children)   # True  -> wrong!!!!!!!

Answer 1

这是因为kentucky_match和new_orleans_match都是NavigableString类的实例，它是常规unicode字符串的子类。

ky_children和no_children都包含基本上字符串的列表，在您的情况下，它只是[u'Bourbon']。 u'Bourbon' in [u'Bourbon']始终评估True。执行in检查时，会比较字符串，而不是NavigableString类实例。

换句话说，您的in支票正在字符串列表中查找字符串。

作为解决方法，您可以使用id()进行in检查：

ky_children = [id(child) for child in kentucky_parent.children]
print(id(kentucky_match) in no_children)      # False
print(id(kentucky_match) in ky_children)      # True

BeautifulSoup错误地检查NavigableString元素的子成员资格？

1 个答案: