检测未知文件中(utf)编码的最佳方法

时间:2018-12-19 20:15:30

标签: python csv unicode byte-order-mark

这是我当前用来打开用户拥有的各种文件的内容:

# check the encoding quickly
with open(file, 'rb') as fp:
    start_data = fp.read(4)
    if start_data.startswith(b'\x00\x00\xfe\xff'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xff\xfe\x00\x00'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xfe\xff'):
        encoding = 'utf-16'
    elif start_data.startswith(b'\xff\xfe'):
        encoding = 'utf-16'
    else:
        encoding = 'utf-8'            

# open the file with that encoding
with open(file, 'r', encoding=encoding) as fp:
    do_something()

是否有比上述方法更好的方法来正确打开未知的utf文件?

1 个答案:

答案 0 :(得分:0)

如果您知道它是utf,则可以使用chardet来执行以下操作:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

with open(file, 'rb') as fp:
    detector.feed(fp.read(1000))
    detector.close()
    raw = detector.result['encoding'].lower()
    encoding = 'utf-32' if ('utf-32' in raw) else 'utf-16' if ('utf-16' in raw) else 'utf-8'

注意:尝试magic或此处问题Determine the encoding of text in Python中提到的其他一些库均无效。另外,请注意,很多时候文件位于utf-8中,它将被标记为ascii