从XML数据创建熊猫数据框

时间:2020-04-30 07:50:40

标签: python xml pandas

我正在处理一个XML数据文件,该文件具有足球比赛期间运动员的跟踪数据。在XML数据文件的顶部看到一个片段:

<?xml version="1.0" encoding="utf-8"?>
<Tracking update="2017-01-23T14:41:26">
  <Match id="2019285" dateMatch="2016-09-13T18:45:00" matchNumber="13">
    <Competition id="20159" name="UEFA Champions League 2016/2017" />
    <Stadium id="85265" name="Estádio do SL Benfica" pitchLength="10500" pitchWidth="6800" />
    <Phases>
      <Phase start="2016-09-13T18:45:35.245" end="2016-09-13T19:31:49.09" leftTeamID="50157" />
      <Phase start="2016-09-13T19:47:39.336" end="2016-09-13T20:37:10.591" leftTeamID="50147" />
    </Phases>
    <Frames>
      <Frame utc="2016-09-13T18:45:35.272" isBallInPlay="0">
        <Objs>
          <Obj type="7" id="0" x="-46" y="-2562" z="0" sampling="0" />
          <Obj type="0" id="105823" x="939" y="113" sampling="0" />
          <Obj type="0" id="250086090" x="1194" y="1425" sampling="0" />
          <Obj type="0" id="250080473" x="37" y="2875" sampling="0" />
          <Obj type="0" id="250054760" x="329" y="833" sampling="0" />
          <Obj type="1" id="98593" x="-978" y="654" sampling="0" />
          <Obj type="0" id="250075765" x="1724" y="392" sampling="0" />
          <Obj type="1" id="53733" x="-4702" y="45" sampling="0" />
          <Obj type="0" id="250101112" x="54" y="1436" sampling="0" />
          <Obj type="1" id="250017920" x="-46" y="-2562" sampling="0" />
          <Obj type="1" id="105588" x="-1449" y="209" sampling="0" />
          <Obj type="1" id="250003757" x="-2395" y="-308" sampling="0" />
          <Obj type="1" id="101473" x="-690" y="-644" sampling="0" />
          <Obj type="0" id="250075775" x="2069" y="-895" sampling="0" />
          <Obj type="1" id="103695" x="-1654" y="-2022" sampling="0" />
          <Obj type="0" id="250073809" x="4712" y="-16" sampling="0" />
          <Obj type="1" id="63733" x="-2393" y="1145" sampling="0" />
          <Obj type="0" id="250015755" x="-42" y="31" sampling="0" />
          <Obj type="0" id="250055905" x="1437" y="-2791" sampling="0" />
          <Obj type="0" id="250042422" x="1169" y="-1250" sampling="0" />
        </Objs>
      </Frame>
      <Frame utc="2016-09-13T18:45:35.319" isBallInPlay="0">
        <Objs>
          <Obj type="7" id="0" x="-46" y="-2558" z="0" sampling="0" />
          <Obj type="0" id="105823" x="938" y="113" sampling="0" />
          <Obj type="0" id="250086090" x="1198" y="1426" sampling="0" />
          <Obj type="0" id="250080473" x="36" y="2874" sampling="0" />
          <Obj type="0" id="250054760" x="330" y="833" sampling="0" />
          <Obj type="1" id="98593" x="-980" y="654" sampling="0" />
          <Obj type="0" id="250075765" x="1727" y="393" sampling="0" />
          <Obj type="1" id="53733" x="-4712" y="44" sampling="0" />
          <Obj type="0" id="250101112" x="54" y="1435" sampling="0" />
          <Obj type="1" id="250017920" x="-46" y="-2558" sampling="0" />
          <Obj type="1" id="105588" x="-1449" y="209" sampling="0" />
          <Obj type="1" id="250003757" x="-2396" y="-310" sampling="0" />
          <Obj type="1" id="101473" x="-692" y="-645" sampling="0" />
          <Obj type="0" id="250075775" x="2071" y="-896" sampling="0" />
          <Obj type="1" id="103695" x="-1655" y="-2016" sampling="0" />
          <Obj type="0" id="250073809" x="4712" y="-17" sampling="0" />
          <Obj type="1" id="63733" x="-2395" y="1145" sampling="0" />
          <Obj type="0" id="250015755" x="-42" y="29" sampling="0" />
          <Obj type="0" id="250055905" x="1435" y="-2793" sampling="0" />
          <Obj type="0" id="250042422" x="1169" y="-1250" sampling="0" />
        </Objs>
      </Frame>
    </Frames>
  </Match>
</Tracking>

据我了解,这是我分解文件的方式:

  • 根文件为Tracking
  • 匹配是跟踪的子代
  • 比赛,体育场,阶段和框架是比赛的孩子
  • 阶段是阶段的子代。
  • Frame是Frames的子代。
  • 框架中有许多框架子级。实际上,在整个足球比赛中每45毫秒就有一个Frame子项。在每个Frame子级中,都有每个球员,裁判和球的球员位置。实际文件将继续存储数千行数据。但是这个片段只是前两帧。

我正在尝试运行以下代码以查看匹配子项中的所有数据:

for x in myroot[0]:
        print(x.tag,x.attrib,x.text)

这是输出:

Competition {'id': '20159', 'name': 'UEFA Champions League 2016/2017'} None
Stadium {'id': '85265', 'name': 'Estádio do SL Benfica', 'pitchLength': '10500', 'pitchWidth': '6800'} None
Phases {} 

Frames {} 

如您所见,输出是两个用于阶段和帧的空字典。我如何从这些孩子那里得到数据?

此外,我的下一个挑战是尝试将这些数据放入大熊猫数据框中,我将如何去做呢?

我希望熊猫日期框架看起来像这样(两个框架的示例,但希望每个框架都使用):

Expected output

1 个答案:

答案 0 :(得分:1)

我使用了xml etree模块来遍历xml并提取相关数据。下面的代码中有注释来解释该过程:看看它,然后使用代码。希望它适合您的用例

import xml.etree.ElementTree as ET
from collections import defaultdict

d = defaultdict(list)
#since u r reading from a file,
# root should be root = ET.parse('filename.xml').getroot()
#mine is wrapped in a string hence :
 root = ET.fromstring(data)
#required data is in the Frame section
for ent in root.findall('./Match//Frame'):
    #this gets us the timestamp
    Frame = ent.attrib['utc']
    for entry in ent.findall('Objs/Obj'):
        #append the objects to the relevant timestamp
        d[Frame].append(entry.attrib)

df = (pd.concat((pd.DataFrame(value) #create dataframe of the values
                 .assign(Frame=key) #assign keys to the dataframe
                 .filter(['id','Frame','x','y','z']) #keep only required columns
                 for key, value in d.items()),
                axis=1) #concatenate on the columns axis
     )

df.head()

id  Frame   x   y   z   id  Frame   x   y   z
0   0   2016-09-13T18:45:35.272 -46 -2562   0   0   2016-09-13T18:45:35.319 -46 -2558   0
1   105823  2016-09-13T18:45:35.272 939 113 NaN 105823  2016-09-13T18:45:35.319 938 113 NaN
2   250086090   2016-09-13T18:45:35.272 1194    1425    NaN 250086090   2016-09-13T18:45:35.319 1198    1426    NaN
3   250080473   2016-09-13T18:45:35.272 37  2875    NaN 250080473   2016-09-13T18:45:35.319 36  2874    NaN
4   250054760   2016-09-13T18:45:35.272 329 833 NaN 250054760   2016-09-13T18:45:35.319 330 833 NaN