我正在查看类似于以下内容的xml文件:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
我使用以下代码解析了该文件:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
但是,我希望通过类似于2D表的事件(如在pandas数据帧中)来分隔节点。但是,完整的xml文件有多个“event”子节点,一些事件不与上面共享相同的节点。我正在努力获取每个事件节点并简单地创建一个带有标记的2d表以及该标记作为列名称并且文本充当值的值。
到目前为止,我已经完成了上述操作来衡量我如何将这些信息放入字典中,然后将一些字典放入一个列表中,我可以使用pandas从中创建一个数据帧,但这还没有解决因为所有的尝试都要求我找到并替换文本以创建dxcictionaries并且python在尝试随后创建数据帧时没有很好地响应。我也用了一个简单的:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
在简单地拔出每个标签和相应的文本时工作得很好,但是我无法做出任何改变,因为任何尝试查找和替换文本以创建词典都是不好的。
非常感谢任何协助。
谢谢。
答案 0 :(得分:2)
这是将XML变为pandas数据帧的简便方法。这利用了令人敬畏的requests
库(如果您愿意,可以切换到urllib,以及pypi中可用的始终有用的xmltodict
库。(注意:也可以使用反向库) ,知道为dicttoxml
)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
为了了解这开始的样子,这里是CSV的前几行:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
请注意,participants
和periods
列仍然是其原生Python词典。您需要将其从列列表中删除,或者进行一些额外的修改以使它们变平:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball