从Python /元素树

时间:2015-09-30 09:51:38

标签: python xml parsing elementtree removechild

我正在尝试使用类似Can Python xml ElementTree parse a very large xml file?

的建议解析ElementTree中的300MB XML
from xml.etree import ElementTree as Et

for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):  
    if elem.tag == 'DescriptorRecord':
        for e in elem._children:
            if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                e.clear()
                elem.remove(e)
                print 'removed %s' % e

...给予

removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>

然而,这只是继续下去,文件没有变得更小,并且在检查时元素仍然存在。试过e.clear()或elem.remove(e),但结果相同。问候

更新

我对@ alexanderlukanin13回答的第一条评论中的错误代码:

回溯(最近一次调用最后一次):文件“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydevd.py”,第1570行,在trace_dispatch Traceback中(最近一次调用最后一次):文件“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydevd.py”,第2278行,在globals = debugger.run(setup [ 'file'],无,无)文件“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydevd.py”,第1704行,运行pydev_imports.execfile(文件,全局,本地)#执行脚本文件“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ runfiles.py”,第234行,在main()文件中“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ runfiles.py”,第78行,主要返回pydev_runfiles.main(配置)#注意:仍然没有返回一个合适的值。文件“C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydev_runfiles.py”,第835行,主PydevTestRunner(配置).run_tests()文件“C:\用户\ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydev_runfiles.py“,第762行,在run_tests file_and_modules_and_module_name = self.find_modules_from_files(files)文件”C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydev_runfiles.py“,第517行,在find_modules_from_files中mod = self .__ get_module_from_str(import_str,print_exception,pyfile)文件”C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydev_runfiles.py“,第476行,在__get_module_from_str中buf_err = pydevd_io.StartRedirect(keep_original_redirection = True,std ='stderr')文件”C:\ Users \ Eddie \ Downloads \ eclipse \ plugins \ org.python.pydev_4.0.0.201504132356 \ pysrc \ pydevd_io.py“,第72行,在StartRedirect中导入sys MemoryError

1 个答案:

答案 0 :(得分:1)

您的脚本中的主要问题是您不会将更改的XML保存回磁盘。您需要存储对根元素的引用,然后调用ElementTree.write

from xml.etree import ElementTree as Et

context = Et.iterparse('input.xml')
root = None
for event, elem in context:
    if elem.tag == 'DescriptorRecord':
        for e in list(elem.getchildren()):  # Don't use _children, it's a private field
            if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                elem.remove(e)  # You need remove(), not clear()
    root = elem

with open('output.xml', 'wb') as file:
    Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)

注意:这里我使用一种笨拙(可能不安全)的方式来获取根元素 - 我假设它始终是iterparse输出中的最后一个元素。如果有人知道更好的方法,请告诉我们。