Question

所以我今天下午遇到了一个问题，我能够解决它，但我不太明白它为什么会起作用。

这与我在另一周遇到的问题有关：python check if utf-8 string is uppercase

基本上，以下内容不起作用：

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

它失败并带有以下内容：

追踪（最近的呼叫最后）：
  文件“./temp.py”，第25行，中       print＆gt;＆gt; outFile，etree.tostring（root，pretty_print = True，xml_declaration = True，encoding ='utf-8'）
  文件“/usr/lib/python2.7/codecs.py”，
  第691行，写作中       return self.writer.write（data）File“/usr/lib/python2.7/codecs.py”，
  第351行，写入       data，consume = self.encode（object，self.errors）
  UnicodeDecodeError：'ascii'编解码器
  无法解码66位的字节0xd0：
  序数不在范围内（128）

但如果我在没有codecs.open('test.xml', 'w', 'utf-8')的情况下打开新文件而是使用 outFile = open('test.xml', 'w')它完美无缺。

发生了什么事？

因为encoding='utf-8'中指定了etree.tostring()，它是否会再次对文件进行编码？
如果我离开codecs.open()并删除encoding='utf-8'该文件，则成为ascii文件。为什么？因为etree.tostring()的默认编码是ascii我是谁？
但是etree.tostring()只是写入stdout，然后重定向到一个创建为utf-8文件的文件？
- print>>是不是像我期望的那样运作？ outFile.write(etree.tostring())的行为方式相同。

基本上，为什么这不起作用？这里发生了什么。这可能是微不足道的，但我显然有点困惑，并希望弄清楚我的解决方案为何有效，

Answer 1

您已使用UTF-8编码打开文件，这意味着它需要Unicode字符串。

tostring编码为UTF-8（以字节串，str的形式），您正在写入该文件。

因为该文件需要Unicode，所以它使用默认的ASCII编码将字节串解码为Unicode，以便它可以将Unicode编码为UTF-8。

不幸的是，字节串不是ASCII。

编辑：避免此类问题的最佳建议是在内部使用Unicode，在输出上解码输入和编码。

Answer 2

使用print>>outFile有点奇怪。我没有安装lxml，但内置的xml.etree库类似（但不支持pretty_print）。将root元素包装在ElementTree中并使用write方法。

此外，如果使用# coding行声明源文件的编码，则可以使用可读的Unicode字符串而不是转义码：

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

Answer 3

除了MRAB之外，还要回答一些代码：

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

为什么打印到utf-8文件失败？

3 个答案: