Beautifulsoup soup.body返回无

时间:2014-07-29 13:05:11

标签: python python-2.7 html-parsing beautifulsoup

什么可以导致beautifulsoup返回soup.body None知道soup.title返回预期结果

这是我正在解析http://goo.gl/6T3RKV

的页面的链接
print(soup.prettify())

给出页面的确切html

1 个答案:

答案 0 :(得分:1)

这是因为differences in BeautifulSoup parsers

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.emploi.ma/offre-emploi-maroc/commerciaux-en-emission-appels-1019077'
>>> soup = BeautifulSoup(urlopen(url), "html5lib")
>>> print soup.body
None

>>> soup = BeautifulSoup(urlopen(url), "html.parser")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
... 

>>> soup = BeautifulSoup(urlopen(url), "lxml")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
...

如您所见,html5lib无法从此特定HTML获取body。并且,根据documentation,如果未安装html5lib,则会选择lxml作为默认值:

  

如果您没有指定任何内容,您将获得最佳的HTML解析器   安装。然后,Beautiful Soup将lxml的解析器列为最佳解析器   html5lib,然后是Python的内置解析器。