Question

我有两个包含许多项目的大型XML文件（c.100MB）。我想要输出它们之间的差异。

每个项目都有一个ID，我需要检查它是否在两个文件中。如果是，那么我需要比较该项目的各个值，以确定它是相同的项目。

SAX解析器是解决此问题的最佳方法吗？它是如何使用的？我使用了元素树和findall工作在较小的文件，但现在我不能用于大文件。

srcTree = ElementTree()
srcTree.parse(srcFile)

# finds all the items in both files
srcComponents = (srcTree.find('source')).find('items')
srcItems = srcComponents.findall('item')
dstComponents = (dstTree.find('source')).find('items')
dstItems = dstComponents.findall('item')

# parses the source file to find the values of various fields of each
# item and adds the information to the source set
for item in srcItems:
  srcId = item.get('id')
  srcList = [srcId]
  details = item.find('values')
  srcVariables = details.findall('value')
  for var in srcVariables:
    srcList.append((var.get('name'),var.text))
srcList = tuple(srcList)
srcSet.add(srcList)

Answer 1

您可以使用elementtree作为拉解析器（如sax）http://effbot.org/zone/element-pull.htm 同样在elementree http://effbot.org/zone/element-iterparse.htm中有一个iterparse函数这两种方法都允许您处理大型文件而无需将所有内容加载到内存中。

但是sax可以工作（我已经处理了大于100MB的数据）但我现在会使用elementtree来完成这项工作。

另请参阅基于增量/事件的解析，使用lxml（etree compatible）http://lxml.de/tutorial.html#event-driven-parsing

这是一篇关于将iterparse用于文件的好文章＆gt; 1GB http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Python文件的Python比较

1 个答案: