这是我当前用来打开用户拥有的各种文件的内容:
# check the encoding quickly
with open(file, 'rb') as fp:
start_data = fp.read(4)
if start_data.startswith(b'\x00\x00\xfe\xff'):
encoding = 'utf-32'
elif start_data.startswith(b'\xff\xfe\x00\x00'):
encoding = 'utf-32'
elif start_data.startswith(b'\xfe\xff'):
encoding = 'utf-16'
elif start_data.startswith(b'\xff\xfe'):
encoding = 'utf-16'
else:
encoding = 'utf-8'
# open the file with that encoding
with open(file, 'r', encoding=encoding) as fp:
do_something()
是否有比上述方法更好的方法来正确打开未知的utf文件?
答案 0 :(得分:0)
如果您知道它是utf
,则可以使用chardet
来执行以下操作:
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
with open(file, 'rb') as fp:
detector.feed(fp.read(1000))
detector.close()
raw = detector.result['encoding'].lower()
encoding = 'utf-32' if ('utf-32' in raw) else 'utf-16' if ('utf-16' in raw) else 'utf-8'
注意:尝试magic
或此处问题Determine the encoding of text in Python中提到的其他一些库均无效。另外,请注意,很多时候文件位于utf-8
中,它将被标记为ascii
。