如何在Python中解析XML文件时解决键错误

时间:2020-06-30 14:53:00

标签: python-3.x xml pandas dataframe xml-parsing

我有以下XML文件,要转换为Pandas DataFrame。

row {'Id': '-1', 'Reputation': '1', 'CreationDate': '2009-09-28T00:00:00.000', 'DisplayName': 'Community', 'LastAccessDate': '2010-11-10T17:25:34.627', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Location': 'on the server farm', 'AboutMe': '<p>Hi, I\'m not really a person.</p>\n\n<p>I\'m a background process that helps keep this site clean!</p>\n\n<p>I do things like</p>\n\n<ul>\n<li>Randomly poke old unanswered questions every hour so they get some attention</li>\n<li>Own community questions and answers so nobody gets unnecessary reputation from them</li>\n<li>Own downvotes on spam/evil posts that get permanently deleted</li>\n<li>Own suggested edits from anonymous users</li>\n<li><a href="http://meta.stackexchange.com/a/92006">Remove abandoned questions</a></li>\n</ul>\n', 'Views': '0', 'UpVotes': '21001', 'DownVotes': '27468', 'AccountId': '-1'}
row {'Id': '1', 'Reputation': '21228', 'CreationDate': '2009-09-28T14:35:46.490', 'DisplayName': 'Anton Geraschenko', 'LastAccessDate': '2020-05-17T06:51:32.333', 'WebsiteUrl': 'http://stacky.net', 'Location': 'Palo Alto, CA, United States', 'AboutMe': '<p>You can get in touch with me at geraschenko@gmail.com.</p>\n', 'Views': '25360', 'UpVotes': '1052', 'DownVotes': '90', 'AccountId': '36500'}

以下代码适用于几乎相同的XML文件,但是当我将其用于此文件时,会出现错误:

代码

users_tree = ET.parse("/content/Users.xml")
users_root = users_tree.getroot()

file_path_users = r"/content/Users.xml"
dict_list_users = []

for _, elem in ET.iterparse(file_path_users, events=("end",)):
    if elem.tag == "row":
        dict_list_users.append({'UserId': elem.attrib['Id'],
                          'Reputation': elem.attrib['Reputation'],
                          'CreationDate': elem.attrib['CreationDate'],
                          'DisplayName': elem.attrib['DisplayName'],
                          'LastAccessDate': elem.attrib['LastAccessDate'],
                          'WebsiteUrl': elem.attrib['WebsiteUrl'],
                          'Location': elem.attrib['Location'],
                          'AboutMe': elem.attrib['AboutMe'],
                          'Views': elem.attrib['Views'],
                          'UpVotes': elem.attrib['UpVotes'],
                          'DownVotes': elem.attrib['DownVotes'],
                          'AccountId': elem.attrib['AccountId']})
elem.clear()

df_users = pd.DataFrame(dict_list_users)

错误

KeyError                                  Traceback (most recent call last)
<ipython-input-18-7af87798bae8> in <module>()
     24                           'DisplayName': elem.attrib['DisplayName'],
     25                           'LastAccessDate': elem.attrib['LastAccessDate'],
---> 26                           'WebsiteUrl': elem.attrib['WebsiteUrl'],
     27                           'Location': elem.attrib['Location'],
     28                           'AboutMe': elem.attrib['AboutMe'],

KeyError: 'WebsiteUrl'

注意:在LastAccessDate之后,所有属性都会发生此错误,即,即使删除了WebsiteUrl键,下一个属性也会出现错误,依此类推。

请为我提供一种解决方法。

1 个答案:

答案 0 :(得分:1)

错误似乎是由于一个或多个<row>标签中缺少属性所致。与其通过每个属性显式分配字典键/值,不如考虑检索 all 属性。这样做,最后的DataFrame构造函数将向缺少属性的行输入NAs

for _, elem in ET.iterparse(file_path_users, events=("end",)):
    if elem.tag == "row":
        dict_list_users.append(elem.attrib)    # RETRIEVE ALL ATTRIBUTES

        elem.clear()                           # SHOULD BE AT NESTED LEVEL

df_users = pd.DataFrame(dict_list_users)

如果上面的列多于所需列,请仅将相关列保留为reindex

df_users = df_users.reindexc(['UserId', 'Reputation', 'CreationDate', 'DisplayName',
                              'LastAccessDate', 'WebsiteUrl', 'Location', 'AboutMe',
                              'Views', 'UpVotes', 'DownVotes', 'AccountId'], 
                              axis='columns')