Question

我一直在解析一些带有特殊字符（捷克语字母）的docx文件（UTF-8编码的XML）。当我尝试输出到stdout时，一切顺利，但我无法将数据输出到文件，

追踪（最近的呼叫最后）：
        文件“./test.py”，第360行，中           ofile.write（U '\吨\吨\吨\吨\吨\ N'）
      UnicodeEncodeError：'ascii'编解码器无法对位置37中的字符u'\ xed'进行编码：序数不在范围内（128）

虽然我明确地将word变量强制转换为unicode类型（type(word)返回unicode），但我尝试使用.encode('utf-8)对其进行编码。我仍然遇到此错误。

以下是现在的代码示例：

for word in word_list:
    word = unicode(word)
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

我也尝试了以下内容：

for word in word_list:
    word = word.encode('utf-8')
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

即使是这两者的组合：

word = unicode(word)
word = word.encode('utf-8')

我有点绝望，所以我甚至试图在ofile.write()

中对单词变量进行编码

ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

我很感激任何我做错的提示。

Answer 1

ofile是一个字节流，您正在编写一个字符串。因此，它会尝试通过编码为字节字符串来处理您的错误。这通常只对ASCII字符安全。由于word包含非ASCII字符，因此失败：

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

使用io.open打开文件，使用类似ofile的模式和明确的编码，使文本流成为'wt'：

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

或者，您也可以使用codecs.open几乎相同的界面，或使用encode手动编码所有字符串。

Answer 2

Phihag的回答是正确的。我只是想建议使用显式编码手动将unicode转换为字节字符串：

ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
             word + u'"/>\n').encode('utf-8'))

（也许你想知道如何使用基本机制代替高级魔法和像io.open这样的黑魔法。）

Answer 3

在写入word文档（.docx）时，我遇到了类似的错误。特别是欧元符号（€）。

x = "€".encode()

哪个错误：

UnicodeDecodeError：'ascii'编解码器无法解码位置0的字节0xe2：序数不在范围内（128）

我是如何解决的是：

x = "€".decode()

我希望这有帮助！

Answer 4

我在stackoverflow中找到的最佳解决方案是在这篇文章中： How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" 放入代码的开头，默认编码将是utf8

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Python 2.7 UnicodeDecodeError：'ascii'编解码器无法解码字节

4 个答案: