Unicode错误ascii无法编码字符

时间:2016-03-05 19:34:37

标签: python python-2.7 csv unicode encoding

我正在尝试导入csv文件以训练我的分类器,但我一直收到此错误

traceback (most recent call last):
File "updateClassif.py", line 17, in <module>
myClassif = NaiveBayesClassifier(fp, format="csv")
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 191, in __init__
    super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 123, in __init__
    self.train_set = self._read_data(train_set, format)
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 143, in _read_data
    return format_class(dataset, **self.format_kwargs).to_iterable()
  File "C:\Python27\lib\site-packages\textblob\formats.py", line 68, in __init__
    self.data = [row for row in reader]
  File "C:\Python27\lib\site-packages\textblob\unicodecsv\__init__.py", line 106, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 55: ordinal not in range(128)

CSV文件包含1600000行推文,所以我相信一些推文包含特殊字符。我尝试使用开放式办公室保存它作为推荐但仍然相同的结果。我也试过使用拉丁语编码但结果相同。 这是我的代码:

with codecs.open('tr.csv', 'r' ,encoding='latin-1') as fp:
myClassif = NaiveBayesClassifier(fp, format="csv")

这是我正在使用的库中的代码:

def __init__(self, csvfile, fieldnames=None, restkey=None, restval=None,
                 dialect='excel', encoding='utf-8', errors='strict', *args,
                 **kwds):
        if fieldnames is not None:
            fieldnames = _stringify_list(fieldnames, encoding)
        csv.DictReader.__init__(self, csvfile, fieldnames, restkey, restval, dialect, *args, **kwds)
        self.reader = UnicodeReader(csvfile, dialect, encoding=encoding,
                                    errors=errors, *args, **kwds)
        if fieldnames is None and not hasattr(csv.DictReader, 'fieldnames'):
            # Python 2.5 fieldnames workaround. (http://bugs.python.org/issue3436)
            reader = UnicodeReader(csvfile, dialect, encoding=encoding, *args, **kwds)
            self.fieldnames = _stringify_list(reader.next(), reader.encoding)
        self.unicode_fieldnames = [_unicodify(f, encoding) for f in
                                   self.fieldnames]
        self.unicode_restkey = _unicodify(restkey, encoding)

    def next(self):
        row = csv.DictReader.next(self)
        result = dict((uni_key, row[str_key]) for (str_key, uni_key) in
                      izip(self.fieldnames, self.unicode_fieldnames))
        rest = row.get(self.restkey)

2 个答案:

答案 0 :(得分:0)

请注意,回溯表示 En codeError,而不是DecodeError。看起来NaiveBayesClassifier期待ascii。要么让它接受Unicode,要么,如果你的应用程序没问题,用'&#39;?&#39;替换非ascii字符。什么的。

答案 1 :(得分:0)

在Python2中,csv module不支持unicode。所以你必须传入某种只产生字节串的迭代器对象(例如文件)。

这意味着您的代码应如下所示:

with open('tr.csv', 'rb') as fp:
    myClassif = NaiveBayesClassifier(fp, format="csv")

但请注意,csv文件必须编码为UTF-8。如果不是,您显然需要先将其转换为UTF-8,以便上述代码能够正常工作。