Question

我正在尝试写出一些确实有一些特殊字符的XML。我遇到麻烦的地方是我遍历一个标签列表来创建几个名为tag的元素。

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as xml

reload(sys)
sys.setdefaultencoding('utf-8')

代码片段：

    check = (video['tags'].split(', '))
    x=len(check)
    y=x-1
    for i in xrange(0,y):
        tagger = xml.SubElement(doc, 'field', name="tag")
        s=check[i]
        tagger.text = s.encode('utf-8')

问题在于我试着写：

output = open(file_name,'w+')
tree = xml.ElementTree(add)
tree.write(output)
output.close()

我收到以下错误：

Traceback (most recent call last):
  File "xml_breakup3.py", line 108, in <module>
    tagger.text = s.encode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: invalid start byte

当我运行没有此代码段的代码时，它会毫无问题地编写xml。如果我使用tagger.text =任何类型的字符串（即'99'），它写得很好。如果我使循环从0到3，它就可以工作。只有当我尝试遍历整个列表时才会出现UnicodeDecode错误

当我尝试：

    check = (video['tags'].split(', '))
    for ta in check:
        tagger = xml.SubElement(doc, 'field', name="tag")
        tagger.text = ta

我明白了：

     Traceback (most recent call last):
       File "xml_breakup3.py", line 172, in <module>
         tree.write(output)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
     _serialize_xml(write, e, encoding, qnames, None)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")

UnicodeDecodeError：'utf8'编解码器无法解码位置0的字节0xba：无效的起始字节

Answer 1

您可能想尝试从正在编码的部分前面删除str。当您使用str时，您将我假设的Unicode转换为字符串，然后您尝试编码。如果您将其保留为Unicode并直接解码，它应该可以工作：

>>> s = u'\xba'
>>> print s
º
>>> s.encode('utf8')
'\xc2\xba'
>>> str(s).encode('utf8')

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    str(s).encode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position 0: ordinal not in range(128)

循环遍历标记和编写XML时出现Unicode错误

1 个答案: