我从数据科学和熊猫开始,我试图用XML信息填充熊猫数据框,这是我的代码:
import xml.etree.cElementTree as et
import pandas as pd
import sys
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
parsed_xml = et.parse("test2.xml")
dfcols = ['Country','Club', 'Founded']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
Country = node.attrib.get('country')
Club = node.find('Name')
Founded = node.find('Founded')
df_xml = df_xml.append(
pd.Series([Country, getvalueofnode(Club),getvalueofnode(Founded)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
这是我的输出:
乡村俱乐部成立
0无无无
这是我的XML文件:
<?xml version="1.0"?>
<SoccerFeed timestamp="20181123T153249+0000">
<SoccerDocument Type="SQUADS Latest" competition_code="FR_L1" competition_id="24" competition_name="French Ligue 1" season_id="2016" season_name="Season 2016/2017">
<Team country="France" country_id="8" country_iso="FR" region_id="17" region_name="Europe" >
<Founded>1919</Founded>
<Name>Angers</Name>
<...>
<Team country="France" country_id="8" country_iso="FR" region_id="17" region_name="Europe" >
<Founded>1905</Founded>
<Name>Bastia</Name>
为什么我无法获得需要的信息的熊猫数据框?我是否错过了代码中的某些内容?谢谢您的帮助
答案 0 :(得分:0)
在XML中,<Founded>
和<Name>
是<Team>
标记的子标记,而country
属性也是<Team>
标记的一部分。因此,我们应该iter
标签上的XML DOM。接下来,应该有某种方法可以在每次迭代中存储<Team>
循环的值,因为这些值将是每一列的行值。为此,我们可以创建三列的字典(for
),并将其值设置为空列表。我们在每次迭代中为每个df_dict
,Country
和Club
附加相应的列表。最后,我们从该字典创建Dataframe(Founded
)。
df
以下是运行此脚本的输出:
import xml.etree.cElementTree as et
import pandas as pd
def main():
parsed_xml = et.parse("test.xml")
df_dict = {'Country':[],'Club':[], 'Founded':[]}
root = parsed_xml.getroot()
for country in root.iter('Team'):
Country = country.attrib.get('country')
Club = country.find('Name').text
Founded = country.find('Founded').text
df_dict['Country'].append(Country)
df_dict['Club'].append(Club)
df_dict['Founded'].append(Founded)
print('Dict for dataframe: {}'.format(df_dict))
df = pd.DataFrame(df_dict)
print("Dataframe: \n{}".format(df))
main()