lxml parse html返回空结果,而beautifulsoup返回合理解析

时间:2017-07-03 09:32:59

标签: python html python-2.7 beautifulsoup lxml

我明白传统上他们说lxml比BeautifulSoup更严格,但是,我没有得到以下内容:

(基本上我要求一个网页,中文,并期望选择一些范围。类似的页面可以正常工作,但对于某些链接lxml只是无法解析)

In [1]: headers = {'User-Agent': ''}

In [2]: url = 'http://basic.10jqka.com.cn/600219/company.html'

In [3]: headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0'}

In [6]: import lxml.html

In [7]: res = requests.get(url, headers=headers)

In [8]: tree = lxml.html.fromstring(res.content)
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-8-b512dc78ed68> in <module>()
----> 1 tree = lxml.html.fromstring(res.content)

/home/jgu/repos/.venv36/lib/python3.6/site-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw)
    874     else:
    875         is_full_html = _looks_like_full_html_unicode(html)
--> 876     doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
    877     if is_full_html:
    878         return doc

/home/jgu/repos/.venv36/lib/python3.6/site-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw)
    763     if value is None:
    764         raise etree.ParserError(
--> 765             'Document is empty')
    766     if ensure_head_body and value.find('head') is None:
    767         value.insert(0, Element('head'))

ParserError: Document is empty

In [12]: from bs4 import BeautifulSoup

In [13]: soup = BeautifulSoup(res.content, 'html.parser')

In [14]: soup.title
Out[14]: <title>南山铝业(600219) 公司资料_F10_同花顺金融服务网</title>

In [15]: sel_query = (
    ...:     '#detail > div.bd > table > tbody > tr:nth-of-type(1) > '
    ...:     'td:nth-of-type(2) > span'
    ...: )

In [16]: soup.select(sel_query)
Out[16]: [<span>山东南山铝业股份有限公司</span>]

In [17]: soup.select(sel_query)[0].text
Out[17]: '山东南山铝业股份有限公司'

正如我所说,像http://basic.10jqka.com.cn/600000/company.html这样的链接确实有效。

所以当解析结果为空时我可以回到bs4,但是我想知道为什么lxml无法从源中解析出合理的dom树。感谢

1 个答案:

答案 0 :(得分:0)

来自fromstring的{​​{1}}功能需要string type variable。并lxml.html返回bytes。使用response.content将是正确的。

对于response.text,其构造函数接受a string or a file-like object