Python - 无法解析utf8 csv

时间:2013-11-24 09:55:35

标签: python csv unicode utf-8

我尝试使用csv模块来解析csv文件,但它不处理utf-8编码。

所以我尝试了文档中建议的这些方法:

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

但是,如果我尝试这样使用它:

with open(u'spam1.csv', 'rb') as csvfile:
    spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        print row

我收到此错误:

yield line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)

但是如果我用libreoffice打开那个文件,它会打开那个带有utf-8编码的csv文件。

1 个答案:

答案 0 :(得分:3)

该代码适用于 unicode值;例如在将数据传递给替换阅读器之前,您需要将数据解码为unicode

使用io.open()将数据读取为Unicode:

import io

with io.open(u'spam1.csv', 'r', encoding='utf8') as csvfile:
    spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        print row

这基本上暂时将unicode编码为UTF8以供CSV模块处理。

由于您的数据已经编码为UTF8,因此您可以逃脱:

with open(u'spam1.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        row = [unicode(cell, 'utf-8') for cell in row]

也是;所以直接从UTF8解码你的行单元格而不先解码为Unicode,然后再次编码为UTF8字节,然后重新解码。