在Python中解析深层嵌套的XML文件

时间:2015-01-05 22:56:13

标签: python xml

我正在查看类似于以下内容的xml文件:

<pinnacle_line_feed>
  <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
  <lastContest>28962804</lastContest>
  <lastGame>162995589</lastGame>
  <events>
    <event>
      <event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
      <gamenumber>422739932</gamenumber>
      <sporttype>Alpine Skiing</sporttype>
      <league>DH 145</league>
      <IsLive>No</IsLive>
      <participants>
        <participant>
          <participant_name>Kjetil Jansrud (NOR)</participant_name>
          <contestantnum>2001</contestantnum>
          <rotnum>2001</rotnum>
          <visiting_home_draw>Visiting</visiting_home_draw>
        </participant>
        <participant>
          <participant_name>The Field</participant_name>
          <contestantnum>2002</contestantnum>
          <rotnum>2002</rotnum>
          <visiting_home_draw>Home</visiting_home_draw>
        </participant>
      </participants>
      <periods>
        <period>
          <period_number>0</period_number>
          <period_description>Matchups</period_description>
          <periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
          <period_status>I</period_status>
          <period_update>open</period_update>
          <spread_maximum>200</spread_maximum>
          <moneyline_maximum>100</moneyline_maximum>
          <total_maximum>200</total_maximum>
          <moneyline>
            <moneyline_visiting>116</moneyline_visiting>
            <moneyline_home>-136</moneyline_home>
          </moneyline>
        </period>
      </periods>
      <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
    </event>
  </events>
</pinnacle_line_feed>

我使用以下代码解析了该文件:

pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'

tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []

for event in root.iter('event'):
    event_datetimeGMT = event.find('event_datetimeGMT').text
    gamenumber = event.find('gamenumber').text
    sporttype = event.find('sporttype').text
    league = event.find('league').text
    IsLive = event.find('IsLive').text
    for participants in event.iter('participants'):
        for participant in participants.iter('participant'):
            p1_name = participant.find('participant_name').text
            contestantnum  = participant.find('contestantnum').text
            rotnum = participant.find('rotnum').text
            vhd = participant.find('visiting_home_draw').text
    for periods in event.iter('periods'):
        for period in periods.iter('period'):
            period_number = period.find('period_number').text
            desc = period.find('period_description').text
            pdatetime = period.find('periodcutoff_datetimeGMT')
            status = period.find('period_status').text
            update = period.find('period_update').text
            max = period.find('spread_maximum').text
            mlmax = period.find('moneyline_maximum').text
            tot_max = period.find('total_maximum').text
            for moneyline in period.iter('moneyline'):
                ml_vis = moneyline.find('moneyline_visiting').text
                ml_home = moneyline.find('moneyline_home').text

但是,我希望通过类似于2D表的事件(如在pandas数据帧中)来分隔节点。但是,完整的xml文件有多个“event”子节点,一些事件不与上面共享相同的节点。我正在努力获取每个事件节点并简单地创建一个带有标记的2d表以及该标记作为列名称并且文本充当值的值。

到目前为止,我已经完成了上述操作来衡量我如何将这些信息放入字典中,然后将一些字典放入一个列表中,我可以使用pandas从中创建一个数据帧,但这还没有解决因为所有的尝试都要求我找到并替换文本以创建dxcictionaries并且python在尝试随后创建数据帧时没有很好地响应。我也用了一个简单的:

for elt in tree.iter():
  list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))

在简单地拔出每个标签和相应的文本时工作得很好,但是我无法做出任何改变,因为任何尝试查找和替换文本以创建词典都是不好的。

非常感谢任何协助。

谢谢。

1 个答案:

答案 0 :(得分:2)

这是将XML变为pandas数据帧的简便方法。这利用了令人敬畏的requests库(如果您愿意,可以切换到urllib,以及pypi中可用的始终有用的xmltodict库。(注意:也可以使用反向库) ,知道为dicttoxml

import json
import pandas
import requests
import xmltodict

web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')

# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)

# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
#   visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))

# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)

为了了解这开始的样子,这里是CSV的前几行:

>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo)  # Output the df to a CSV file
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball

请注意,participantsperiods列仍然是其原生Python词典。您需要将其从列列表中删除,或者进行一些额外的修改以使它们变平:

# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball