Python unicode解码不适用于outlook导出的csv

时间:2014-01-06 03:03:00

标签: python encoding

您好我导出了outlook contacts csv文件并将其加载到python shell中。

我在列表中有许多欧洲名称,例如

tmp = 'Fern\xc3\x9fndez'
tmp.encode("latin-1")

导致错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

tmp.decode('latin-1')

给了我

u'Fern\xc3\x9fndez'

如何将文字读作Fernandez? (不要太担心口音 - 但很乐意拥有它们)

1 个答案:

答案 0 :(得分:1)

您必须使用Python 2.x.这是打印字符的一种方法(取决于您使用的编码):

>>> tmp = 'Fern\xc3\x9fndez'
>>> print tmp.decode('utf-8')  # print formats the string for stdout
Fernßndez
>>> print tmp.decode('latin1')
FernÃndez

你确定你的角色合适吗?是utf-8吗?另一种方式:

>>> print unicode(tmp, 'latin1')
FernÃndez

>>> print unicode(tmp, 'utf-8')
Fernßndez

有趣。所以这些选项都不适合你吗?顺便说一句,我通过其他一些编码运行字符串,看看是否有任何一个字符更符合我的预期。不幸的是,我认为看起来不太合适:

>>> for encoding in ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8']:
    try:
        print encoding + ': ' + tmp.decode(encoding)
    except:
        pass

cp037: ãÁÊ>C¤>ÀÁ:
cp437: Fernßndez
cp500: ãÁÊ>C¤>ÀÁ:
cp737: Fern├θndez
cp775: Fern├¤ndez
cp850: Fernßndez
cp852: Fern├čndez
cp855: Fern├Ъndez
cp857: Fern├şndez
cp860: Fern├Óndez
cp861: Fernßndez
cp862: Fernßndez
cp863: Fernßndez
cp865: Fernßndez
cp866: Fern├Яndez
cp869: Fern├ίndez
cp875: ΖΧΈ>Cμ>ΦΧ:
cp932: Fernテ殤dez
cp949: Fern횩ndez
cp1006: Fernﺣndez
cp1026: ãÁÊ>C¤>ÀÁ:
cp1140: ãÁÊ>C€>ÀÁ:
cp1250: FernĂźndez
cp1251: FernГџndez
cp1252: Fernßndez
cp1254: Fernßndez
cp1256: Fernأںndez
cp1258: FernĂŸndez
gbk: Fern脽ndez
gb18030: Fern脽ndez
latin_1: FernÃndez
iso8859_2: FernĂndez
iso8859_4: FernÃndez
iso8859_5: FernУndez
iso8859_6: Fernأndez
iso8859_7: FernΓndez
iso8859_9: FernÃndez
iso8859_10: FernÃndez
iso8859_13: FernĆndez
iso8859_14: FernÃndez
iso8859_15: FernÃndez
koi8_r: Fernц÷ndez
koi8_u: Fernц÷ndez
mac_cyrillic: Fern√Яndez
mac_greek: FernΟündez
mac_iceland: Fernßndez
mac_latin2: Fernßndez
mac_roman: Fernßndez
mac_turkish: Fernßndez
ptcp154: FernГҹndez
shift_jis: Fernテ殤dez
shift_jis_2004: Fernテ殤dez
shift_jisx0213: Fernテ殤dez
utf_16: 敆湲鿃摮穥
utf_16_be: 䙥牮쎟湤敺
utf_16_le: 敆湲鿃摮穥
utf_8: Fernßndez