使用DictWriter for utf-8时的UnicodeEncodeErrors

时间:2012-08-03 17:24:43

标签: unicode utf-8 python-2.7 export-to-csv

我正在尝试将包含utf-8字符串的字典写入CSV。我按照here的说明操作。然而,尽管对这些utf-8字符串进行了精心编码和解码,但我得到了一个涉及'ascii'集的UnicodeEncodeErrors。

我有一个字典列表,其中包含字符串和整数,作为与维基百科文章更改相关的值。以下列表对应this change,例如:

edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'}, 
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]

问题是list[1]['editorName']。它的类型为'str'el[1]['editorName'].decode('utf-8')u'Eep\xb2'

我正在尝试的代码是:

_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
    with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
        w = csv.DictWriter(f,sorted(edgelist[0].keys()))
        w.writeheader()
        for d in edgelist:
            for k,v in d.items():
                if type(v) == int:
                    d[k]=str(v).encode(_ENCODING)
            w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})

返回:

dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)

其他排列,例如交换解码用于编码或在最终有问题的行中没有任何内容也会返回错误:

  1. w.writerow({k:v.encode(_ENCODING) for k,v in d.items()})返回'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
  2. w.writerow({k:v for k,v in d.items()})返回UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
  3. 关注this后,我将with codecs.open(output_file,'wb',encoding=_ENCODING) as f:更改为with open(output_file,'wb') as f:,但仍然收到同样的错误。
  4. 排除列表元素或包含此有问题字符串的键,否则脚本工作正常。

3 个答案:

答案 0 :(得分:3)

我刚刚编写了如下代码,并成功编写了csv。

from django.utils.encoding import smart_str
import csv

def dictToCSV(edgelist, output_file):
    f = open(output_file, 'wb')
    w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
    w.writeheader()
    for d in edgelist:
        w.writerow(dict(k=smart_str(v)) for k, v in d.items())
    f.close()

复制Django代码并根据需要进行自定义。

答案 1 :(得分:0)

对ASCII编码的严格解释仅允许序数0-127。根据定义,该范围之外的任何值都不是ASCII。既然\ xc2& \ xb2的序数高于127,它们不能解释为ASCII。

我不是Python用户RFC for CSV提到ASCII作为常用用法,但为MIME类型定义了一个可选的'charset'参数;我想知道你正在使用的作家是否也有'编码'设置?

答案 2 :(得分:0)

您的字符串已经是UTF-8,而DictWriter不适用于codecs.open。遵循that示例:

# coding: utf-8
import csv

edgelist = [
    {'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
    {'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]

with open('out.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
    w = csv.DictWriter(f,sorted(edgelist[0].keys()))
    w.writeheader()
    for d in edgelist:
        w.writerow(d)

输出:

articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749

注意,您可以直接使用'editorName': 'Eep²'代替'editorName': 'Eep\xc2\xb2'。如果以UTF-8保存源文件,则字节字符串将按照# coding: utf-8进行UTF-8编码。