有效地解析大型XML文件

时间:2018-11-22 12:13:01

标签: python lxml

我正在处理大小超过4GB的XML文件,想知道如何最好地解析它们。现在我遇到内存问题,正在寻找一种不将整个文件加载到内存中并可能分批处理的方法?

当前代码使用lxml并遍历重复元素。命名空间已预先清除:

from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(mvnFile, parser)
    root = tree.getroot()

    ####
    for elem in root.getiterator():
        if not hasattr(elem.tag, 'find'): continue  # (1)
        i = elem.tag.find('}')
        if i >= 0:
            elem.tag = elem.tag[i + 1:]
    objectify.deannotate(root, cleanup_namespaces=True)
    ####
    data = [{
        'Element1': tp.findtext('element1'),
        'Element2': tp.findtext('element2'),
        'Element3': tp.findtext('element3'),
    }
        for tp in tree.xpath('//mainelement/subelement')]

     df = pd.DataFrame(data)
print(df)

此外,我需要对元素的值进行一些拆分,因为它们之间是用空格分隔的。但是,我只需要特定的值,所以我想知道是否可以在解析中以某种方式执行此操作,而不是事后在空间上拆分列?

xml示例:

<mainelement>
    <subelement tc="00:00:00:000" ms="0">
        <element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
        <element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
        <element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
    </subelement>
    <subelement tc="00:00:00:001" ms="1">
        <element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
        <element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
        <element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
    </subelement>
    <subelement tc="00:00:00:002" ms="2">
        <element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
        <element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
        <element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
    </subelement>
    <subelement tc="00:00:00:003" ms="3">
        <element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
        <element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
        <element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
    </subelement>
</mainelement>

1 个答案:

答案 0 :(得分:0)

根据您在评论中发布的链接,我想到了以下内容来更有效地迭代和拆分,效果很好:

from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'

time = []
data1_element1_x = []
data1_element1_y = []
data1_element2_x = []
data1_element2_y = []
data2_element1_x = []
data2_element1_y = []
data2_element2_x = []
data2_element2_y = []

if file.lower().endswith('.xml'):
    for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
            time.append(elem.get('tc'))
            for child in elem:
                if child.tag == "element1":
                    split_data = child.text.split(" ")
                    data1_element1_x.append(float(split_data[0]))
                    data1_element1_y.append(float(split_data[1]))
                    data2_element1_x.append(float(split_data[2]))
                    data2_element1_y.append(float(split_data[3]))
                elif child.tag == "element2":
                    split_data = child.text.split(" ")
                    data1_element2_x.append(float(split_data[0]))
                    data1_element2_y.append(float(split_data[1]))
                    data2_element2_x.append(float(split_data[2]))
                    data2_element2_y.append(float(split_data[3]))
             elem.clear()
df = pd.DataFrame({
    'Time':time, 
    'Data1_element1_x': data1_element1_x, 
    'Data1_element1_y': data1_element1_y, 
    'Data1_element2_x': data1_element2_x, 
    'Data1_element2_y': data1_element2_y, 
    'Data2_element1_x': data2_element1_x, 
    'Data2_element1_y': data2_element1_y, 
    'Data2_element2_x': data2_element2_x, 
    'Data2_element2_y': data2_element2_y
})

print(df)