Question

我想知道如何读取非ASCII编码的字母，而不会被＃34;损坏＆＃34;。

这是娱乐活动：

print(open("somefile.txt").read())

somefile.txt（另存为unicode）

čđža

我得到的是这样的：

ÿþ ~a

如何获取文件的原始内容？

Answer 1

您要将文件打开为cp1252，您应该将其打开为utf-16。

（ÿþ表示UTF-16LE字节顺序标记被错误地解释为Windows-1252。）

>>> open('foo.txt', encoding='utf-16').read()
'čđža'
>>> open('foo.txt', encoding='cp1252').read()
'ÿþ\n\x01\x11\x01~\x01a\x00'

在unix系统上，您可以使用file查看文件中的内容：

~$ file foo.txt
foo.txt: Little-endian UTF-16 Unicode text, with no line terminators

在Python中，chardet库对此有好处：

>>> chardet.detect(open('foo.txt', 'rb').read())
{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}