我有一个充满XML文件的目录。现在,我有一些代码可以将这些XML文件中的数据读写到Pandas Dataframe中。我将XML文件转换成字典,然后再将其Json_normalize。有没有更有效的方法可以做到这一点?你有什么建议吗?
这是我的代码:
#libraries
from pandas.io.json import json_normalize
import pandas as pd
import xmltodict
import os
import glob
import errno
import tkinter
from tkinter import Frame, Button
#directory specifications and variables
path = r'C:\Users\Nutzer\Desktop\XML_Files\*.xml'
files = glob.glob(path)
frame_list = []
def convert_xml(files):
for name in files:
#the try clause ensures that non-xml files are passed (skipped)
try:
with open(name) as f:
#reading,parsing and normalization of XML Data
frame_list.append(json_normalize(xmltodict.parse(f.read()), sep = '_'))
pass
#exception is raised in case file is not found or dic is full
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
return frame_list
#concat list of frame to one large frame; sort = True ensure the insertion of NaN for missing values
df = pd.concat(convert_xml(files), ignore_index=True, sort=True)
df
以下是我的XML文件的示例:
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Contract_Information>
<Company>Enterprisa</Company>
<Time_Stamp>2019-07-18T10:24:51</Time_Stamp>
<Datei-ID>3785690</Datei-ID>
</Contract_Information>
<Calculations Document_ID="2668815">
<Calculationsoftware>Sonstige</Calculationsoftware>
<Contractdate>2019-05-31</Contractdate>
<Documentnumber>23864836</Documentnumber>
<case>
<casenumber>XX123456778</casenumber>
</case>
</Calculations>
<Closing_case>false</Closing_case>
<Additionaldata>
<customer_ID>354634287</customer_ID>
<services>3</services>
</Additionaldata>
<Messages>
<Message Code="1" Stufe="Notification">Message</Message>
</Messages>
</Data>
请注意,我有多个这样的文件,它们的结构相似,但可能并不总是包含相同数量的字段和属性。这就是为什么我在pd.concat行中使用sort = true和ignore_index = True的原因。
答案 0 :(得分:0)
尚未完全提取数据,但是您可以将其作为基础并对其进行处理。
代码:
import xml.etree.ElementTree as et
import os
path_to_xmls = r'C:\Users\Nutzer\Desktop\XML_Files\'
xml_files = [pos_xml for pos_xml in os.listdir(path_to_xmls) if pos_xml.endswith('.xml')]
for xml_file in xml_files:
xtree = et.parse(xml_file)
xroot = xtree.getroot()
for node in xroot:
for n in node:
print(n.tag + ':' + n.text)
输出:
Company:Enterprisa
Time_Stamp:2019-07-18T10:24:51
Datei-ID:3785690
Calculationsoftware:Sonstige
Contractdate:2019-05-31
Documentnumber:23864836
case:
customer_ID:354634287
services:3
Message:Message