Question

在处理Python中的Unicode和UTF-8字符和编码时，我总是很困惑。对于我将在下面详述的内容可能有一个简单的解释，但到目前为止，我无法绕过它。

假设我有一个非常简单的.csv文件，其中包含非ascii字符：

tildes.csv：

Año,Valor
2001,Café
2002,León

我想使用csv.DictReader对象读取该文件，并将其键/值存储为unicode字符串，并在python dict中正确处理（未转义）。我已经看到Tornado和Django正确处理unicode键/值集，所以我对自己说“是的，我也可以这样做!! ......但是没有......看起来我可以'吨。

import csv

with open('tildes.csv', 'r') as csv_f:
    reader = csv.DictReader(csv_f)
    for dct in reader:
        print "dct (original): %s" % dct
        for k, v in dct.items():
            print '%s: %s' % (unicode(k, 'utf-8'), unicode(v, 'utf-8'))
        utf_dct = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) \
                  for k, v in dct.items())
        print utf_dct

所以，我想：好的，我从文件中读了一个dict（它的键是Año和Valor），它们将被加载 ascii 使用转义字符，但之后我可以将它们编码为unicode值并将它们用作键... 错误！

这是我在运行上述代码时看到的内容：

dct (original): {'A\xc3\xb1o': '2001', 'Valor': 'Caf\xc3\xa9'}
Año: 2001
Valor: Café
{u'A\xf1o': u'2001', u'Valor': u'Caf\xe9'}
dct (original): {'A\xc3\xb1o': '2002', 'Valor': 'Le\xc3\xb3n'}
Año: 2002
Valor: León
{u'A\xf1o': u'2002', u'Valor': u'Le\xf3n'}

所以第一行显示字典'就是'（转义）。好，这里没什么奇怪的。然后我print解析为unicode的所有键/值。它以我想要的方式显示角色。也不错。但是，当我打印它们时，使用我用来重新编码字符串的完全相同的指令，我尝试创建一个dict（utf_dct变量），当我打印它时，我得到了值再次逃脱。

编辑1 ：

实际上，我认为我甚至不需要一个csv文件来表明我的意思。我刚刚在我的控制台中尝试了这个：

>>> print "Año"
Año                      # Yeey!! There's hope!
>>> print {"Año": 2001}
{'A\xc3\xb1o': 2001}     # 2 chars --> Ascii, I think I get this part 
>>> print {u"Año": 2001}
{u'A\xf1o': 2001}        # What happened here? 
                         # Why am I seeing the 0x00F1 UTF-8 code 
                         # from the Latin-1 Supplement (wiki:
                         # http://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
                         # instead of an ñ?

为什么我不能打印显示{u'Año': 2001}的字典？我的终端明确接受了它。这里发生了什么？

Answer 1

当您打印字符串本身时，使用其str()表示法将其“很好地”打印。当您打印字典时，其内容将使用repr()表示打印，该表示始终会转义。在这两种情况下，字符串的内容都是相同的，只是Python以不同的方式显示它们。同样的原因是，在第一种情况下Año周围没有打印引号，但在第二种情况下，'A\xc3\xb1o'周围会打印引号。它只是两种不同的显示格式。

这是一个更简单的例子，可以帮助说明这种情况：

>>> import unicodedata
>>> unicodedata.name('\u00f1') # 00F1 is unicode code point for this character
'LATIN SMALL LETTER N WITH TILDE'
>>> print(str(u'\u00f1')) # str() gives a displayable character
ñ
>>> print repr(u'\u00f1') # repr() gives an escaped representation
u'\xf1'
>>> print repr(str(u'\u00f1')) # repr() of the str() shows the two characters in the UTF-8 encoding -- this is what happens when showing a dict
'\xc3\xb1'
>>> len(str(u'\u00f1')) # the str() is two bytes long (UTF-8 encoded)
2
>>> len(repr(u'\u00f1')) # the repr() is 7 bytes long (`u`, `'`, `\`, `x`, `f`, `1`, `'`)
7

有related bug report建议更改此行为，以便repr不会转义非ASCII字符。根据该错误报告，此更改是在Python 3中进行的，因此您看到这样做的工具可能正在使用Python 3.

个别工具也可以显示他们喜欢的任何东西。工具不必只调用str(someDict)并显示结果;如果需要，它可以“手动”调用dict内容的str，并从中构建自己的可显示版本。

UTF-8＆＃34;不一致＆＃34;将CSV文件中的unicode键/值存储到dict中时

1 个答案: