Question

我正在尝试使用Python Beautiful Soup 4库解析一个大的html文档。

该页面包含一个非常大的表，结构如下：

<table summary='foo'>
    <tbody>
        <tr> 
            A bunch of data 
        </tr>
        <tr>
            More data 
        </tr>
        .
        .
        .
        100s of <tr> tags later
    </tbody>
</table>

我有一个函数来评估soup.descendants中的给定标记是否属于我要查找的那种。这是必要的，因为页面很大（BeautifulSoup告诉我该文档包含大约4000个标签）。就像这样：

def isrow(tag):
    if tag.name == u'tr':
        if tag.parent.parent.name == u'table' and \
                tag.parent.parent.has_attr('summary'): 
            return True

我的问题是当我遍历soup.descendants时，该函数只返回表中前77行的True，当我知道<tr>标记继续数百时的行。

这是我的功能问题还是我不明白BeautifulSoup如何生成其后代的集合？我怀疑它可能是Python或bs4内存问题，但我不知道如何解决它。

Answer 1

更像是一个有根据的猜测，但我会试一试。

BeautifulSoup解析HTML的方式很大程度上取决于underlying parser。如果您没有specify it explicitly，BeautifulSoup会根据内部排名自动选择一个：

如果您没有指定任何内容，您将获得最佳的HTML解析器安装。然后，Beautiful Soup将lxml的解析器列为最佳解析器 html5lib，然后是Python的内置解析器。

在你的情况下，我会尝试切换解析器，看看你会得到什么结果：

soup = BeautifulSoup(data, "lxml")  # needs lxml to be installed
soup = BeautifulSoup(data, "html5lib")  # needs html5lib to be installed
soup = BeautifulSoup(data, "html.parser")  # uses built-in html.parser

美丽的汤过滤功能无法找到表的所有行

1 个答案: