我是python的新手,我正在寻找一个使用以下模板解析大xml文件(~0.5-1 G)的快速实现:
<timestep time="2.00">
<vehicle id="carflow.0" x="-9897.274589" y="-8.250000" speed="49.840822" lane="section1_0" />
.... (more vehicles)
</timestep>
... (more timesteps)
我愿意将它解析为DataFrame。 我的代码是(ET是lxml.etree):
def parseXML(filename):
df = pd.DataFrame()
old_time = 0.0
time = 0.0
events = ("end","start")
tree = ET.iterparse(filename, events=events)
for event, elem in tree:
if elem.tag == "timestep" and event =="start":
time = float(elem.attrib.get('time'))
elif elem.tag == "timestep" and event =="end":
elem.clear()
elif elem.tag == 'vehicle' and event=="end":
id = int(elem.attrib.get('id').split('.')[1])
x = float(elem.attrib.get('x'))
y = float(elem.attrib.get('y'))
speed = float(elem.attrib.get('speed'))
lane = int(elem.attrib.get('lane').split('_')[1])
data = pd.DataFrame([time, id, x, y, speed, lane]).T
elem.clear()
df = df.append(data)
if time%50 == 0 and time!=old_time:
old_time = time
print(time)
df.columns = ['time','id','x','y','speed','lane']
return df
有没有办法改进我的代码?