Question

什么可以导致beautifulsoup返回soup.body None知道soup.title返回预期结果

的页面的链接

print(soup.prettify())

给出页面的确切html

Answer 1

这是因为differences in BeautifulSoup parsers：

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.emploi.ma/offre-emploi-maroc/commerciaux-en-emission-appels-1019077'
>>> soup = BeautifulSoup(urlopen(url), "html5lib")
>>> print soup.body
None

>>> soup = BeautifulSoup(urlopen(url), "html.parser")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
... 

>>> soup = BeautifulSoup(urlopen(url), "lxml")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
...

如您所见，html5lib无法从此特定HTML获取body。并且，根据documentation，如果未安装html5lib，则会选择lxml作为默认值：

如果您没有指定任何内容，您将获得最佳的HTML解析器安装。然后，Beautiful Soup将lxml的解析器列为最佳解析器 html5lib，然后是Python的内置解析器。

Beautifulsoup soup.body返回无

1 个答案: