Question

我有几个大的xml.bz2文件需要解析。

我有兴趣在<text></text>中获取背景信息。这些xml文件格式错误，在开头和结尾都缺少<mediawiki xml:lang="en"></mediawiki xml:lang="en">（如https://en.wikipedia.org/wiki/Help:Export所示）。当我运行我的代码如下：

from lxml import etree
context = etree.iterparse("pages1.xml", tag = "text")

for event, elem in context :
    print elem.xpath( 'description/text( )' )
    elem.clear( )
    while elem.getprevious( ) is not None :
        del elem.getparent( )[0]

del context

我收到了错误

lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 256, column 3

我搜索并发现了这篇帖子parsing large xml file with Python - etree.parse error，建议用标签包装整个XML。但是，我很困惑如何将主标记添加到现有文档中。我的xml文件的结构如下。我非常感谢你的帮助。谢谢。

<page>
  <title>Page title</title>
  <!-- page namespace code -->
  <ns>0</ns>
  <id>2</id>
  <!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
  <redirect title="Redirect page title" />
  <restrictions>edit=sysop:move=sysop</restrictions>
  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>
  <revision>
    <timestamp>2001-01-15T13:10:27Z</timestamp>
    <contributor><ip>10.0.0.2</ip></contributor>
    <comment>new!</comment>
    <text>An earlier [[revision]].</text>
  </revision>
  <revision>
    <!-- deleted revision example -->
    <id>4557485</id>
    <parentid>1243372</parentid>
    <timestamp>2010-06-24T02:40:22Z</timestamp>
    <contributor deleted="deleted" />
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text deleted="deleted" />
    <sha1/>
  </revision>
</page>

<page>
  <title>Talk:Page title</title>
  <revision>
    <timestamp>2001-01-15T14:03:00Z</timestamp>
    <contributor><ip>10.0.0.2</ip></contributor>
    <comment>hey</comment>
    <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
  </revision>
</page>

将顶级标记添加到xml文件

0 个答案: