Question

doc = open("1.html").read().strip()
doc = doc.decode("utf-8","ignore")

这个例子没问题。我可以获得正确的unicode字符串doc。

doc = open("1.html").read().strip()
if u"charset=utf" in doc or u"charset=\"utf" in doc:
    doc = doc.decode("utf-8","ignore")

有错误“UnicodeDecodeError：'ascii'编解码器无法解码位置289中的字节0xe7：序数不在范围内（128）” 有谁能解释一下？字符串doc可以通过字符串查找更改吗？忘了说，1.html包含中文单词。

Answer 1

问题是您要将从文件中读取的字节字符串与unicode文字字符串u"charset=utf"和u"charset=\"utf"进行比较。为了比较它们，Python必须在此之前将字节字符串转换为unicode - 在您手动调用decode之前 - 它使用默认的ASCII编解码器。

解决方案是始终将字节字符串与字节字符串进行比较：

if "charset=utf" in doc or "charset=\"utf" in doc: