Question

我正在使用BeautifulSoup，我或者想出了一个错误或错误。在我的例子中，我抓了纽约时报的一个子网站......

import urllib2
from bs4 import BeautifulSoup
website = "http://www.nytimes.com/pages/politics/index.html"
data = BeautifulSoup(urllib2.urlopen(website).read())
print data

当我运行代码时，我会返回头标记以及其中的内容。但是，它不会抓取body标签内的内容。如果我要将网站网址更改为http://www.nytimes.com，则BS会返回整页来源。这里发生了什么，为什么我爬行http://www.nytimes.com/pages/politics/index.html时没有得到身体标签？

Answer 1

这不是BeautifulSoup中的错误。问题实际上是因为bs4使用内置的HTMLParser，它对格式错误的HTML不是很宽松，因为W3C Markup Validation Service显示HTML确实是格式错误的，并且几乎没有未封闭，流浪和错位的TAGS导致HTMLParser和随后的BeautifulSoup停止突然解析。

此问题已在针对BeautifulSoup提交的以下错误中进行了解释

BS4 stops parsing after malformed tag

BeautifulSoup只返回head标签内的内容

1 个答案: