Question

我正在使用lxml解析一个大的xml文件（~2GB），该文件包含已发布它们的文章和作者（如此）：

<article>
 <author>Name 1</author>
 <author>Name 2</author>
 <title> title </title>
 <year> 777 </year>
 <ref> some ref </ref>
 <citi>/here/there<citi>
</article>

我需要做的是获取包含'citi'标签中某些单词的作者姓名，并记录标签出现的次数。
（基本上计算已完成与关键字相关的一些工作的作者数量（并且还要计算作者使用该关键字的次数））

有两个问题：
1.我的xml文件包含一些外部实体，如（<author> Name &Oun </author>），我希望它们被忽略。我在网上看到，默认情况下lxml并没有解析那些身份，但它会解析所有条目，如果它被抛出我只是捕获异常。
2.但是，它不解析整个文件，并在抛出异常后停在某一点。

我认为这是因为在阅读下一个chunk之前，异常被抛出，我不确定如何避免这种情况。

我目前的工作代码是这样的：
（这只是一个临时代码，所以我意识到可以有更好的方法来做一些步骤，但如果你觉得它可以改进，请告诉我）

authors=Counter()
cache_authors=[]

def parseXMLDOC():
    flag=True
    try:
        for event, elem in etree.iterparse(self.file):
        # Keep the current authors in a cache
       if elem.tag == "author":
           cache_authors.append(elem.text)

        # check for keyword
       if elem.tag=="cite" and flag:
             # checks if the keyword exists and if it does, it adds
             # it to authors Counter above
           flag=not self.checkCitations(elem.text)

       # clean up for parsing the next article
       if elem.tag == "article":
           cache_article=[]
           flag=True
       # print event,elem.tag,elem.text
           elem.clear();
     except etree.XMLSyntaxError:
           print("Unidentified entities encountered")

Answer 1

在处理如此大的XML文件时，老式的SAX方法优于DOM。基本上，您不希望在RAM中保留文档的大型解析树并进行导航。相反，您会对打开和关闭标记等单个事件做出反应。例如，请参阅the pyexpat module documentation。这种方法效率更高，但更乏味：您必须实现（小）状态机。

lxml用于大文本文件，在文件结束前停止

1 个答案: