有没有更有效的方法将XML文件目录转换为单个Pandas Dataframe?

时间:2019-11-13 14:33:56

标签: python xml pandas file dataframe

我有一个充满XML文件的目录。现在,我有一些代码可以将这些XML文件中的数据读写到Pandas Dataframe中。我将XML文件转换成字典,然后再将其Json_normalize。有没有更有效的方法可以做到这一点?你有什么建议吗?

这是我的代码:

#libraries
from pandas.io.json import json_normalize
import pandas as pd
import xmltodict
import os
import glob
import errno

import tkinter
from tkinter import Frame, Button

#directory specifications and variables
path = r'C:\Users\Nutzer\Desktop\XML_Files\*.xml'
files = glob.glob(path)
frame_list = []


def convert_xml(files): 
    for name in files:
        #the try clause ensures that non-xml files are passed (skipped)          
            try:
                with open(name) as f:
                    #reading,parsing and normalization of XML Data
                    frame_list.append(json_normalize(xmltodict.parse(f.read()), sep = '_'))

                pass 
            #exception is raised in case file is not found or dic is full
            except IOError as exc:
                if exc.errno != errno.EISDIR:
                    raise

   return frame_list

#concat list of frame to one large frame; sort = True ensure the insertion of NaN for missing values
df = pd.concat(convert_xml(files), ignore_index=True, sort=True)

df

以下是我的XML文件的示例:

<?xml version="1.0" encoding="UTF-8"?>
<Data>
    <Contract_Information>
        <Company>Enterprisa</Company>
        <Time_Stamp>2019-07-18T10:24:51</Time_Stamp>
        <Datei-ID>3785690</Datei-ID>
    </Contract_Information>
    <Calculations Document_ID="2668815">
        <Calculationsoftware>Sonstige</Calculationsoftware>
        <Contractdate>2019-05-31</Contractdate>
        <Documentnumber>23864836</Documentnumber>
        <case>
            <casenumber>XX123456778</casenumber>
        </case>
    </Calculations>
    <Closing_case>false</Closing_case>
    <Additionaldata>
        <customer_ID>354634287</customer_ID>
        <services>3</services>
    </Additionaldata>
    <Messages>
        <Message Code="1" Stufe="Notification">Message</Message>
    </Messages>
</Data>

请注意,我有多个这样的文件,它们的结构相似,但可能并不总是包含相同数量的字段和属性。这就是为什么我在pd.concat行中使用sort = true和ignore_index = True的原因。

1 个答案:

答案 0 :(得分:0)

尚未完全提取数据,但是您可以将其作为基础并对其进行处理。

代码:

import xml.etree.ElementTree as et
import os

path_to_xmls = r'C:\Users\Nutzer\Desktop\XML_Files\'
xml_files = [pos_xml for pos_xml in os.listdir(path_to_xmls) if pos_xml.endswith('.xml')]

for xml_file in xml_files:
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()

    for node in xroot:
        for n in node:
            print(n.tag + ':' + n.text)

输出:

Company:Enterprisa
Time_Stamp:2019-07-18T10:24:51
Datei-ID:3785690
Calculationsoftware:Sonstige
Contractdate:2019-05-31
Documentnumber:23864836
case:

customer_ID:354634287
services:3
Message:Message