我有一些具有不同未知编码的文本文件。现在我必须先打开一个二进制文件来检测编码,然后再用该编码打开它。
bf = open(f, 'rb')
code = chardet.detect(bf.read())['encoding']
print(f + ' : ' + code)
bf.close()
with open(f, 'r', encoding=code) as source:
texts = extractText(source.readlines())
source.close()
with open(splitext(f)[0] + '_texts.txt', 'w', encoding='utf-8') as dist:
dist.write('\n\n'.join('\n'.join(x) for x in texts))
dist.close()
那么有更好的方法来处理这个问题吗?
答案 0 :(得分:2)
您可以只解码已阅读的文字,而不是重新打开和重读文件:
with open(filename, 'rb') as fileobj:
binary = fileobj.read()
probable_encoding = chardet.detect(binary)['encoding']
text = binary.decode(probable_encoding)