使用lxml和xpath从网页中提取注释时的UnicodeDecodeError

时间:2016-12-13 15:59:02

标签: python xpath lxml

您好我正在尝试使用lxml和xpath在网页上提取评论。这是我的代码:

pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)

cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
for cm in cm_pg:
    print cm

我收到了这个错误

Traceback (most recent call last):
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
    process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
    cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
  File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
  File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
  File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
  File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
  File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
  File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
  File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
  File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte

我知道评论中有一个无效字符。我该如何解决这个问题?

2 个答案:

答案 0 :(得分:0)

您能否请求请求尝试为您解码?使用response.text(字符串)而不是response.content(字节)。

源代码的encoding可能不是UTF-8,XPath库可能会假设它。 response.encoding是最佳猜测的请求。有时,网络服务器/网页没有配置为明确说出他们使用的编码,然后您可以做的就是猜测。

无法帮助在HTTP标头和/或<meta>标记中指定编码。或者网站可以说谎。或者他们可能混合编码。请注意您的target website can't even validate,因为编码错误,即使这样,也很容易出错。

答案 1 :(得分:0)

页面编码错误。
例如:

Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)

您可以通过手动解码来避免它们:

>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'