Question

我正在解析以下页面：http://www.amazon.de/product-reviews/B004K1K172 使用基于lxml的etree进行解析。

包含整个页面内容的内容变量

代码：

myparser = etree.HTMLParser(encoding="utf-16") #As characters are beyond utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

这将返回一个空列表。

但是当我将代码更改为：

myparser = etree.HTMLParser(encoding="utf-8") #Neglecting some reviews having ascii character above utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

现在我使用相同的Xpath获取正确的数据。但大部分评论被拒绝。这是基于lxml的xpath还是我的xpath实现的问题？

如何使用utf-16编码解析上述页面？

Answer 1

基于nymk的建议
使用ISO-8859-15编码解析页面。因此在代码中更改以下行。

myparser = etree。 HTMLParser（encoding =“ISO-8859-15”）
但是必须在SQL中进行更改，以便接受utf-8以外的编码。

Answer 2

自动从http标头获取字符编码：

import cgi
import urllib2

from lxml import html

response = urllib2.urlopen("http://www.amazon.de/product-reviews/B004K1K172")

# extract encoding from Content-Type 
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html_text = response.read().decode(params['charset'])

root = html.fromstring(html_text)
reviews = root.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

使用utf-16解析LXML Xpath失败

2 个答案: