Question

我现在已经阅读了很多关于Python 3中UTF-8编码主题的内容，但它仍然不起作用，我找不到我的错误。

我的代码看起来像这样

def main():

    with open("test.txt", "rU", encoding='utf-8') as test_file:
        text = test_file.read()
    print(str(len(text)))


if __name__ == "__main__":

    main()

我的test.txt文件看起来像这样

ö

我收到以下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte

Answer 1

您的文件不是UTF-8编码的。我不确定哪种编码使用F6代替ä;该代码点是Latin 1和CP-1252中ö的编码：

>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'

您需要使用您用于创建该文件的任何工具将该文件另存为UTF-8。

如果open('text').read()有效，那么您就可以使用默认系统编码对文件进行解码。请参阅open() function documentation：

encoding 是用于解码或编码文件的编码的名称。这应该只在文本模式下使用。默认编码取决于平台（无论locale.getpreferredencoding()返回什么），但可以使用Python支持的任何编码。

这并不是说您使用正确的编码读取文件;这只是意味着默认编码没有中断（遇到的字节没有字符映射）。它仍然可以将这些字节映射到错误的字符。

我敦促您阅读Unicode和Python：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Python Unicode HOWTO
Pragmatic Unicode

Python 3 UTF-8编码确实不起作用

1 个答案: