Question

我不知道这是我对UTF-8还是python的误解，但是我无法理解python如何将Unicode字符写入文件。顺便提一下，我在OSX下的Mac上，如果这有所作为的话。

我们说我有以下unicode字符串

foo=u'\x93Stuff in smartquotes\x94\n'

这里\ x93和\ x94是那些可怕的智能引号。

然后我将其写入文件：

with open('file.txt','w') as file: file.write(foo.encode('utf8'))

当我在文本编辑器中打开文件时，如TextWrangler或在网络浏览器file.txt中，似乎它被写为

\ xc2 \ x93 **智能引用中的内容\ xc2 \ x94 \ n

文本编辑器正确理解文件是UTF8编码的，但它将\ xc2 \ x93呈现为垃圾。如果我进去并手动删除\ xc2部分，我得到了我期望的结果，TextWrangler和Firefox将utf字符渲染为智能引号。

这正是我将文件读回python而未将其解码为＆＃39; utf8＆＃39;时的结果。但是，当我使用read().decode('utf8')方法读取它时，我会返回我最初输入的内容，而没有\ xc2位。

这让我疯狂，因为我试图将一堆html文件解析成文本，并且这些unicode字符的错误呈现会搞砸一堆东西。

我也经常使用读/写方法在python3中尝试过它，它具有相同的行为。

编辑：关于手动删除\ xc2，我发现它是正确呈现的，因为浏览器和文本编辑器默认为拉丁语编码。

此外，作为后续工作，Filefox将文本呈现为

☐畅销智能引用☐

其中框是空的unicode值，而Chrome将文本呈现为

智能引号中的东西

Answer 1

问题是，u'\x93'和u'\x94'不是智能引号的Unicode代码点。它们是Windows-1252编码中的智能引号，与latin1编码不同。在latin1中，未定义这些值。

>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'

所以你应该选择以下之一：

foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'

或在UTF-8源文件中：

#coding:utf8
foo = u'“Stuff in smartquotes”'

编辑：如果您的某个Unicode字符串中包含不正确的字节，可以使用以下方法来修复它们。前256个Unicode代码点使用latin1编码映射1：1，因此可以使用它将错误解码的Unicode字符串直接编码回字节字符串，以便可以使用正确的解码：

>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果你有UTF-8编码版本的错误Unicode字符：

>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果最糟糕的情况是以下Unicode字符串：

>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

Unicode字符从python I / O输出到文件

1 个答案: