Question

我正在使用Python和BeautifulSoup来解析HTML页面。不幸的是，对于某些页面（＆gt; 400K），BeatifulSoup会截断HTML内容。

我使用以下代码获取“div”的集合：

findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
    print it

在某个时刻，输出看起来像：

correct string, correct string, incomplete/truncated string ("So, I")

虽然，htmlSource包含字符串“所以，我很无聊”等等。另外，我想提一下，当我对树进行美化时，我看到HTML源被截断了。

您是否知道如何解决此问题？

谢谢！

Answer 1

尝试使用lxml.html。它是一个更快，更好的HTML解析器，并且比最新的BeautifulSoup更好地处理损坏的html。它适用于您的示例页面，解析整个页面。

import lxml.html

doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))

上面的代码返回131个div。

Answer 2

我在beautifulsoup-where-are-you-putting-my-html使用BeautifulSoup找到了解决此问题的方法，因为我认为它比lxml更容易。

您唯一需要做的就是安装：

pip install html5lib

并将其作为参数添加到BeautifulSoup：

soup = BeautifulSoup(html, 'html5lib')