Question

注意：我使用的是Windows 7,64位系统 - 刚刚安装了cygwin。

我需要从许多不同的大型（100 MB）XML文件中提取大量数据。 xml文件包含一系列行序列，如下所示：

<taggie>
lotsolines which include some string that I'm searching for.
</taggie>

我想提取从开始标记到包含搜索字符串的结束标记的所有内容。（无论是在python中执行此操作还是在cygwin中执行此操作都是一种折腾。）

我的计划是编写一个脚本来预处理这些xml文件中的一个开始和结束标记表，并为开始端创建一个行号参考表。像

这样的东西

filename, start line (begin tag), end line (end tag)
bogusname.xml, 50025, 100003

然后我再做一次搜索来创建我的字符串出现的列表。看起来可能是这样的。

filename, searchstring, line number
bogusname.xml, "foo", 76543

然后我针对第一个列表处理第二个列表，以提取信息（可能是第二个大文件或者可能是一组文件。我现在不在乎。

无论如何，当我这样做时，有人几乎肯定已经完成了这件事，或者与它非常相似的东西。

那么，任何人都可以指示我已经执行此操作的代码吗？ Python首选，但cygwin的unix样式脚本会很方便。我更喜欢源代码到任何我无法看到源代码的可执行文件。

与此同时，我正在继续自己的行动。提前致谢。

对于确切的数据，我正在下载此文件（例如）： http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip 我解压缩它，我想提取包含任何大量搜索字符串的XML文档。这是一个包含数千个连接XML文档的单个文件。我想提取包含其中一个搜索字符串的XML。

我正在尝试使用BeautifulSoup：

from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import urllib2

xml_handle = open("t.xml", "r")
soup = BeautifulSoup(xml_handle)

i = 0
for grant in soup('us-patent-grant'):
    i = i + 1
    print (i)
    print (grant)

print (i)

当我这样做时，它给出的最终值为9。如果它获得了所有'us-patent-grant'标签，我预计我会超过6000 - 这表明它可能无法解析整个文件。

Answer 1

我目前正在处理类似的Python问题。我知道这已经晚了几年，但是我将分享我分析相似大文件的经验。

我发现Python的内置xml.etree.ElementTree对此非常有效（称为CElementTree的C实现具有相同的API，并且也是内置的）。我尝试了文档中的所有方法，iterparse() with clear()是该库中最快的（比我在Python中实现的任何其他实现快x5）。通过这种方法，您可以以增量方式将XML加载到内存中或从内存中清除XML，并将其作为流处理（使用生成器）。这比将整个文件加载到内存中要好得多，这可能会降低计算机的爬网速度。

参考：

The accepted answer here explains basically the best approach that I could find.

This IBM site talks about lxml which is similar to the xml library but has better XPath support.

lxml website和cElementTree website比较xml和lxml包的执行速度。

Answer 2

（过去的回答）

使用python包beautifulsoup怎么样？加上正则表达式。 BeautifulSoup是处理.html，.xml文件最着名的工具。进口重新来自bs4 import BeautifulSoup

f = open("filename.xml")
xml = f.read()
soup = BeautifulSoup(xml)
find_search = re.compile("[search]+")
#remain code here....

查看此网站http://www.crummy.com/software/BeautifulSoup/bs4/doc/了解beautifulsoup，和https://docs.python.org/2/library/re.html用于正则表达式语法。

但您可以在访问此网页后轻松完成所需内容。

=============================================== =======================

文件太大，因此您需要一些代码将文件拆分为单独的文件。从链接Split diary file into multiple files using Python，您可以将代码编写为

<!-- language: python -->
def files():
    n = 0
    while True:
        n += 1
        yield open('xml_%d.xml' % n, 'w')
pat = '<?xml'
fs = files()
outfile = next(fs) 
with open("ipg150106.xml") as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

此代码向我提供xml_6527.xml。

def files():
    n = 0
    while True:
        n += 1
        yield open('xml_%d.xml' % n, 'w')

if __name__ == '__main__':
    #make file seperate
    # pat = '<?xml'
    # fs = files()
    # outfile = next(fs) 

    # with open("ipg150106.xml") as infile:
    #     for line in infile:
    #         if pat not in line:
    #             outfile.write(line)
    #         else:
    #             items = line.split(pat)
    #             outfile.write(items[0])
    #             for item in items[1:]:
    #                 outfile = next(fs)
    #                 outfile.write(pat + item)

    #analyzing each file
    import os
    pwd = os.path.dirname(os.path.realpath(__file__))
    xml_files = [xml_file for xml_file in os.listdir(pwd) if os.path.isfile(os.path.join(pwd, xml_file))]

    for f in xml_files:
        xml = f.read()
        soup = BeautifulSoup(xml)
        #Remain code here..

（抱歉奇怪的代码块:(）

从大文本（XML）文件

2 个答案: