Question

当我在sklearn中使用CountVectorizer时，它需要使用unicode进行文件编码，但是我的数据文件却使用ansi进行编码。

我尝试使用notepad ++将编码更改为unicode，然后使用readlines，它无法读取所有行，而只能读取最后一行。之后，我尝试将行读入数据文件，然后使用unicode将其写入新文件，但是失败了。

def merge_file():
    root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
    resname='resule_final.txt'
    if os.path.exists(resname):
        os.remove(resname)
    result = codecs.open(resname,'w','utf-8')
    num = 1
    for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
        current_dir = root_dir + str(back_name)
        for filename in os.listdir(current_dir):
            print num ,":" ,str(filename)
            num = num+1
            path=current_dir + "\\" +str(filename)
            source=open(path,'r')
            line = source.readline()
            line = line.strip('\n')
            line = line.strip('\r')

            while line !="":
                line = unicode(line,"gbk")
                line = line.replace('\n',' ')
                line = line.replace('\r',' ')
                result.write(line + ' ')
                line = source.readline()
            else:
                print 'End file :'+ str(filename)
                result.write('\n')
                source.close()
    print 'End All.'
    result.close()

错误消息是：UnicodeDecodeError：'gbk'编解码器无法解码0-1位置的字节：非法的多字节序列

Answer 1

哦，我找到路了。首先，使用chardet检测字符串编码。其次，使用编解码器以特定编码输入或输出到文件。这是代码。

import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
    current_dir = root_dir + str(back_name)
    for filename in os.listdir(current_dir):
       print num,":",str(filename)
       num=num+1
       path=current_dir+"\\"+str(filename)
       content = open(path,'r').read()
       source_encoding=chardet.detect(content)['encoding']
       if source_encoding == None:
           print '??' , filename
           failed.append(filename)
       elif source_encoding != 'utf-8':
           content=content.decode(source_encoding,'ignore')
           codecs.open(path,'w',encoding='utf-8').write(content)
print failed

感谢您的所有帮助。

如何将Ansi编码转换为unicode

1 个答案: