我正在处理大小超过4GB的XML文件,想知道如何最好地解析它们。现在我遇到内存问题,正在寻找一种不将整个文件加载到内存中并可能分批处理的方法?
当前代码使用lxml并遍历重复元素。命名空间已预先清除:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
此外,我需要对元素的值进行一些拆分,因为它们之间是用空格分隔的。但是,我只需要特定的值,所以我想知道是否可以在解析中以某种方式执行此操作,而不是事后在空间上拆分列?
xml示例:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
答案 0 :(得分:0)
根据您在评论中发布的链接,我想到了以下内容来更有效地迭代和拆分,效果很好:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time = []
data1_element1_x = []
data1_element1_y = []
data1_element2_x = []
data1_element2_y = []
data2_element1_x = []
data2_element1_y = []
data2_element2_x = []
data2_element2_y = []
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)