什么可以导致beautifulsoup返回soup.body
None
知道soup.title
返回预期结果
这是我正在解析http://goo.gl/6T3RKV
的页面的链接print(soup.prettify())
给出页面的确切html
答案 0 :(得分:1)
这是因为differences in BeautifulSoup
parsers:
>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.emploi.ma/offre-emploi-maroc/commerciaux-en-emission-appels-1019077'
>>> soup = BeautifulSoup(urlopen(url), "html5lib")
>>> print soup.body
None
>>> soup = BeautifulSoup(urlopen(url), "html.parser")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
...
>>> soup = BeautifulSoup(urlopen(url), "lxml")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
...
如您所见,html5lib
无法从此特定HTML获取body
。并且,根据documentation,如果未安装html5lib
,则会选择lxml
作为默认值:
如果您没有指定任何内容,您将获得最佳的HTML解析器 安装。然后,Beautiful Soup将lxml的解析器列为最佳解析器 html5lib,然后是Python的内置解析器。