我知道有关utf-8编码和解码的问题已被问过很多次了,但我找不到问题的答案。
我在windows1252
中有一个CSV文件,我想在UTF-8
中创建,这是脚本:
import os
import sys
import inspect
import codecs
import chardet
from bs4 import UnicodeDammit
#Declare the variables
defaultencoding = 'utf-8'
filename = '19-01-2017+06-00-00.csv'
#open the file and get the content
file_obj = open(filename,"r")
content = file_obj.read()
file_obj.close()
#Check the initial encoding using both unicodeDammit and chardet
dammit = UnicodeDammit(content)
#print it
print(dammit.original_encoding)
print(chardet.detect(content)['encoding'])
#Decode in UTF8
content_decoded = content.decode('windows-1252')
content_encoded = content_decoded.encode(defaultencoding)
#Write the result in a temporary file
file_obj = open('tmp.txt',"w")
try:
file_obj.write(content_encoded)
finally:
file_obj.close()
#Read the result decoded file
file_obj = open('tmp.txt', "r")
content = file_obj.read()
file_obj.close()
#Check if it is really in UTF8 using both unicodeDammit and chardet
dammit = UnicodeDammit(content)
print(dammit.original_encoding)
print(chardet.detect(content)['encoding'])
输出:
windows-1252
windows-1252
windows-1252
windows-1252
预期产出:
windows-1252
windows-1252
utf-8
utf-8
我使用了chardet
和uncodeDammit
,因为我发现chardet
并未始终提供正确的编码猜测。
为什么不能用utf-8编码文件?