Question

这是我当前用来打开用户拥有的各种文件的内容：

# check the encoding quickly
with open(file, 'rb') as fp:
    start_data = fp.read(4)
    if start_data.startswith(b'\x00\x00\xfe\xff'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xff\xfe\x00\x00'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xfe\xff'):
        encoding = 'utf-16'
    elif start_data.startswith(b'\xff\xfe'):
        encoding = 'utf-16'
    else:
        encoding = 'utf-8'            

# open the file with that encoding
with open(file, 'r', encoding=encoding) as fp:
    do_something()

是否有比上述方法更好的方法来正确打开未知的utf文件？

Answer 1

如果您知道它是utf，则可以使用chardet来执行以下操作：

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

with open(file, 'rb') as fp:
    detector.feed(fp.read(1000))
    detector.close()
    raw = detector.result['encoding'].lower()
    encoding = 'utf-32' if ('utf-32' in raw) else 'utf-16' if ('utf-16' in raw) else 'utf-8'

注意：尝试magic或此处问题Determine the encoding of text in Python中提到的其他一些库均无效。另外，请注意，很多时候文件位于utf-8中，它将被标记为ascii。

检测未知文件中（utf）编码的最佳方法

1 个答案: