我正在尝试读取包含xml和unicode的gzip文件,但是我收到了错误。我正在使用的代码是:
import gzip
import xml
path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()
unicodeContents = gzContents.encode('utf-8')
xmlContent = xml.dom.minidom.parseString(unicodeContents)
# Do stuff with xmlContent
当我运行此代码时,我收到以下错误(以xmlContent
开头的行失败)
/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/minidom.pyc in parseString(string, parser)
1922 if parser is None:
1923 from xml.dom import expatbuilder
-> 1924 return expatbuilder.parseString(string)
1925 else:
1926 from xml.dom import pulldom
/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(string, namespaces)
938 else:
939 builder = ExpatBuilder()
--> 940 return builder.parseString(string)
941
942
/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(self, string)
221 parser = self.getParser()
222 try:
--> 223 parser.Parse(string, True)
224 self._setup_subset(string)
225 except ParseEscape:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1141336: ordinal not in range(128)
我找到了与此Reading utf-8 characters from a gzip file in python类似的上一个答案,但我仍然收到错误。
xml解析器有问题吗?
(我使用的是Python 2.7。?)
答案 0 :(得分:5)
您无法将unicode字符串传递给xml.dom.minidom.parseString
。
它必须是一个适当编码的字节串:
>>> import xml.dom.minidom as xmldom
>>>
>>> source = u"""\
... <?xml version="1.0" encoding="utf-8"?>
... <root><text>Σὲ γνωρίζω ἀπὸ τὴν κόψη</text></root>
... """
>>> doc = xmldom.parseString(source.encode('utf-8'))
>>> print doc.getElementsByTagName('text')[0].toxml()
<text>Σὲ γνωρίζω ἀπὸ τὴν κόψη</text>
修改强>
只是为了澄清一下 - 从gzip压缩文件中读取的流应直接传递给解析器,而不试图对其进行编码或解码:
import gzip
import xml
path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()
xmlContent = xml.dom.minidom.parseString(gzContents)
解析器将从文件开头的xml声明中读取编码(如果没有,则假定为“utf-8”)。然后它可以使用它将内容解码为unicode。