解析Stackoverflow Posts.xml数据转储文件崩溃程序,给出ascii编码错误

时间:2013-08-20 11:22:36

标签: python xml encoding elementtree

我已经下载了Stackoverflow 2013年6月的数据转储,现在正在解析XML文件并存储在MySQL数据库中。我正在使用Python ElementTree来做它并且它不断崩溃并给我编码错误。

解析代码片段:

post = open('a.xml', 'r')
a = post.read()  
tree = xml.parse((a).encode('ascii', 'ignore')) # I also tried .encode('utf-8').strip() it doesn't work

#Get the root node

row = tree.findall("row")

它给了我以下错误:

'ascii' codec can't encode character u'\u2019' in position 248: ordinal not in range(128)

我也尝试使用以下内容,但问题仍然存在。

.encode('ascii', 'ignore')

任何建议解决问题将不胜感激。此外,如果有人链接到干净的数据也会有所帮助。

另外,我的最终目标是将数据转换为RDF,因此如果有人以RDF格式进行StackOverflow数据转储,我将不胜感激。

提前致谢!

p.s这是导致问题并导致程序崩溃的XML行:

<row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />

编辑:@Arjan您提到的解决方案here对我不起作用。

1 个答案:

答案 0 :(得分:0)

您没有提到您使用的是哪个版本的Python,并且版本2和版本3如何处理unicode存在差异,因此这可能是一个因素。由于您遇到了麻烦,我猜您使用的是版本2.x,因为版本3通常更优雅地处理unicode。

ElementTree了解如何解析包含unicode的xml文件(或字符串),而不需要str.encode()。假设使用Python 2.7,下面的代码可以解析包含你问题中带有unicode字符的行的xml文件:

首先,以下是为测试而创建的名为“test.xml”的xml文件的内容,其中包含有问题的行:

<?xml version="1.0"?>
<rows>
    <row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />
</rows>

解析上述文件的代码:

>>> import xml.etree.ElementTree as xml
>>> tree = xml.parse('test.xml') # Assuming code lives in same directory as file
>>> # File is now parsed into variable 'tree',
>>> # and we can check the problematic unicode character is in there
>>> body = tree.find('row').attrib['Body']
>>> # We can look at the escaped unicode character...
>>> body [238:256]
the system\u2019s timer
>>> # Or we can view it represented as we would expect to read it
>>> print body[238:256]
the system’s timer

如果以此为例仍然会产生错误,也许您可​​以提供一些有关您的问题的其他信息。