Question

我正在使用python2.7和lxml。我的代码如下

import urllib
from lxml import html

def get_value(el):
    return get_text(el, 'value') or el.text_content()

response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Frisco/DavidMcDavidHondaofFrisco/fullsales-504210667.html').read()
dom = html.fromstring(response)

try:
    description = get_value(dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='sales-review-paragraph loose-spacing']")[0])
except IndexError, e:
    description = ''

代码在try内部崩溃，发出错误

UnicodeDecodeError at /
'utf8' codec can't decode byte 0x92 in position 85: invalid start byte

无法编码/解码的字符串是：ouldn t

我尝试过使用很多技术，包括.encode（'utf8'），但都没有解决问题。我有两个问题：

如何解决此问题
当问题代码介于try

Answer 1

该页面正在使用charset=ISO-8859-1进行投放。从那解码到unicode。

[ Snapshot of details from a browser. Credit @Old Panda]

Answer 2

您的except子句仅处理IndexError类型的异常。问题是UnicodeDecodeError，它不是IndexError - 因此异常不会被该except子句处理。

还不清楚'get_value'的作用，这可能是实际问题出现的地方。

Answer 3

- 跳过错误的字符，或正确解码为unicode。
- 你只捕获IndexError，而不是UnicodeDecodeError

Answer 4

解码对unicode的响应，在使用fromhtml解析之前正确处理错误（忽略错误）。
捕获UnicodeDecodeError或所有错误。

Python错误：'utf8'编解码器无法解码位置85中的字节0x92：无效的起始字节

4 个答案: