我有几个大的xml.bz2文件需要解析。
我有兴趣在<text></text>
中获取背景信息。这些xml文件格式错误,在开头和结尾都缺少<mediawiki xml:lang="en"></mediawiki xml:lang="en">
(如https://en.wikipedia.org/wiki/Help:Export所示)。当我运行我的代码如下:
from lxml import etree
context = etree.iterparse("pages1.xml", tag = "text")
for event, elem in context :
print elem.xpath( 'description/text( )' )
elem.clear( )
while elem.getprevious( ) is not None :
del elem.getparent( )[0]
del context
我收到了错误
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 256, column 3
我搜索并发现了这篇帖子parsing large xml file with Python - etree.parse error,建议用标签包装整个XML。但是,我很困惑如何将主标记添加到现有文档中。我的xml文件的结构如下。我非常感谢你的帮助。谢谢。
<page>
<title>Page title</title>
<!-- page namespace code -->
<ns>0</ns>
<id>2</id>
<!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
<redirect title="Redirect page title" />
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[revision]].</text>
</revision>
<revision>
<!-- deleted revision example -->
<id>4557485</id>
<parentid>1243372</parentid>
<timestamp>2010-06-24T02:40:22Z</timestamp>
<contributor deleted="deleted" />
<model>wikitext</model>
<format>text/x-wiki</format>
<text deleted="deleted" />
<sha1/>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>