Question

我正在从网页上读取页面源，然后从该源解析一个值。在那里我遇到了特殊字符的问题。

在我的python控制器文件iam中使用# -*- coding: utf-8 -*-。但我正在阅读使用charset=iso-8859-1

的网页来源

因此，当我在没有指定任何编码的情况下阅读页面内容时，它会将错误抛出为UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte

当我使用string.decode("iso-8859-1").encode("utf-8")时，它正在解析数据而没有任何错误。但它显示的值为'F \ u00fcnke'而不是'Fünke'。

请告诉我如何解决此问题。我非常感谢任何建议。

Answer 1

编码在Python3中肯定是PITA（在某些情况下也是2）。尝试检查这些链接，它们可能会帮助您：

Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

http://docs.python.org/2/library/codecs.html

对"So when I read the page content without specifying any encoding"的代码也很好。我最好的猜测是你的控制台不使用utf-8（例如，windows ..你的# -*- coding: utf-8 -*-只告诉Python什么类型的要在源代码中查找的字符，而不是代码要解析或分析自身的实际数据。例如，我写道：

# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

Python网页源使用特殊字符读取

1 个答案: