Question

使用python 2.7我正在阅读unicode并写作utf-16-le。大多数字符都被正确解释。但有些人不是，例如，你也称为unichr(33034)。以下代码无法正确写入：

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

但无论如何，当上面替换时，使代码工作。

unichr（33033）和unichr（33035）正常工作。
'utf-8'编码（无BOM，byte-order mark）。

如何识别无法正确写入的字符，以及如何编写带有BOM的“utf-16-le”编码文件，以打印这些字符或进行替换？

Answer 1

您正在以文本模式打开文件，这意味着line-break characters/bytes will be translated to the local convention。不幸的是，您尝试编写的字符包含一个字节0A，该字节被解释为换行符，并且无法正确写入文件。

以二进制模式打开文件：

open('temp.txt','wb')

Answer 2

@ Joni的答案是问题的根源，但是如果你使用codecs.open，它总是以二进制模式打开，即使没有指定。使用utf16编解码器也会使用本机字节序自动编写BOM：

import codecs
with codecs.open('temp.txt','w','utf16') as temp:
    temp.write(u'\u810a')

temp.txt的十六进制转储：

FF FE 0A 81

参考：codecs.open

Answer 3

您已经在使用编解码器库了。使用该文件时，您应该使用open（）与codecs.open（）进行交换，以透明地处理编码。

import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
    temp.write(unichr(33033))
    temp.write(unichr(33034))
    temp.write(unichr(33035))

如果之后遇到问题，您的查看器可能会出现问题，而不是Python脚本。

用python编写unicode - 这个字符出了什么问题

3 个答案: