Question

我在2天内尝试找到以下代码中的错误，该代码将xls文件转换为CSV文件我的问题是输出CSV上的某些字符未正确编码（é，à...等）。我已经阅读了大量关于SOF的帖子，但我找不到解决方案。我知道问题来自csv模块，它只处理Ascii或UTF-8，但我不知道如何处理它。我还使用了替换模块unicodecsv但没有成功。我知道unicode csv unicode example有一些例子，但我不知道使用它的正确方法。

我确信我的xls是在utf_16_LE（工作簿编码）上编码的。

这是我在SOF上找到的代码。我尝试了很多修改但没有成功的方法。有人可以分辨出代码的哪一部分发生了变化。

#!/usr/bin/env python
# -*- coding: utf8 -*-
import xlrd
from os import sys
import csv 


def csv_from_excel(excel_file):

    workbook = xlrd.open_workbook(excel_file)
    print workbook.biff_version, workbook.codepage, workbook.encoding
    #test read of accent charactere
    rs = workbook.sheet_by_index(0)
    print rs.cell_value(1,0)

    all_worksheets = workbook.sheet_names()
    for worksheet_name in all_worksheets:
        worksheet = workbook.sheet_by_name(worksheet_name)
        your_csv_file = open(''.join([worksheet_name,'.csv']), 'wb')

        class ExcelFr(csv.excel):
        #Separateur de champ
            delimiter = ";"

        csv.register_dialect('excel-fr', ExcelFr())

        wr = csv.writer(your_csv_file,'excel-fr', quoting=csv.QUOTE_ALL)

        for rownum in xrange(worksheet.nrows):
            wr.writerow([unicode(entry).encode("utf-8") for entry in worksheet.row_values(rownum)])

        your_csv_file.close()

#if __name__ == "__main__":
#    csv_from_excel(sys.argv[1])

csv_from_excel("source-2014-02-12.xls")

编辑：新代码：只转换第一张纸（我不需要任何纸张）。

#!/usr/bin/env python
# -*- coding: utf8 -*-
import xlrd
import unicodecsv
import codecs

def csv_from_excel(excel_file):

    wb = xlrd.open_workbook(excel_file)
    print wb.biff_version, wb.codepage, wb.encoding
    sh = wb.sheet_by_name('Feuil1')
    print sh.row_values(8)
    #your_csv_file = open('your_csv_file.csv', 'wb')
    your_csv_file = codecs.open('your_csv_file.csv','wb')

    class ExcelFr(unicodecsv.excel):
        #Separateur de champ
        delimiter = ";"

    unicodecsv.register_dialect('excel-fr', ExcelFr())

    wr = unicodecsv.writer(your_csv_file,'excel-fr',encoding='utf-8', quoting=unicodecsv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))
        #wr.writerow([unicode(entry).encode("utf-8") for entry in sh.row_values(rownum)])

    your_csv_file.close()

csv_from_excel("source-2014-02-13.xls")


reader = unicodecsv.reader("your_csv_file.csv")
print reader.encoding

输出：

80 1200 utf_16_le [u'Chaise德按摩ergonomique pliante”，u'Facile \ xe0蒙特等可调整的\ xe0 TOUT gabarit等倾TOUT TRAITEMENT杜浩特杜军团COMME LA吨\ xeate，乐DOS，LES \ xe9paules和les胸罩。 Le soutien pour la t \ xeate est amovible et ajustable comme l \ u2019assise et l \ u2019accoudoir。 Le massage sur chaise est une mani \ xe8re tr \ x88s efficace de stimuler la circulation du sang，de l \ u2019 \ xe9nergie et permet au corps de retrouver un certain \ xe9quilibre。 A noter que la chaise peut \ xe9galement \ xeatre utilis \ xe9e comme chaise de tatouage。 '，u'Fauteuil de massage blanc，pliant et facile \ xe0 transporter ..... etc等 UTF-8

正如你所看到的那样，我说的是'\ xe0'或'\ u2019'

我仍然不明白所有乱码的编码事情！

Answer 1

在你的情况下，这是错误的：

your_csv_file = open(''.join([worksheet_name,'.csv']), 'wb')

标准Python open()函数打开二进制文件，因此您需要确保自己正确编码数据。您应该导入codecs模块并使用：

your_csv_file = codecs.open(''.join([worksheet_name,'.csv']), 'w', 'utf-8')

我同意你的观点，unicode(entry).encode("utf-8")应该具有相同的效果。

如果我的建议无效，那么您需要告诉我们您认为“某些字符编码不正确”的原因。

Answer 2

好像你根本不明白你所看到的是什么

打开空闲状态并输入

print u" mani\xe8re tr\xe8s" \ x ##只是一个没有ascii表示的十六进制数字，

print u"l\u2019assise et l\u2019accoudoir"将证明\ u ####只是一个没有重复表现的unicode角色

Unicode从xls到CSV

2 个答案: