我的磁盘上有以下目录结构的xml数据(接至12-13文件夹)。每个子文件夹都有1-100个xml(假设现在固定为5个文件),每个文件在一个子文件夹中具有相同的架构。同一文件夹中的文件是从单个大xml文件中拆分出来的,因此它们在解析需求上是相同的。我想为每个文件夹创建一个熊猫数据框。我做了一个解析器函数,一个.txt
文件,其中包含所有文件的路径。我使用了for循环来制作我的第一个df文件,但我认为最好从txt文件(或列表)中读取文件,识别子文件夹并使用适当的名称制作一个df文件,然后将文件写入相应的文件夹中
|
+---1
| | MetaFileInfo.txt
| |
| \---1
| 00001.xml
| 00002.xml
| 00003.xml
| 00004.xml
|
+---10
| | MetaFileInfo.txt
| |
| \---1
| 00001.xml
|
import xmltodict
import numpy as np
import pandas as pd
from collections import Counter
import os
import glob
import re #Should I use regex to read the filenames
#Get a list of xml filenames with full paths
rootDir = "."
exten = '.xml'
logname = 'myfiles.log'
results = str()
for dirpath, dirnames, files in os.walk(rootDir):
for name in files:
if name.lower().endswith(exten):
results += '%s\n' % os.path.join(dirpath, name)
with open (logname, 'w') as logfile:
logfile.write(results)
#assume only 5 files in each folder
doc = []
for i in range(1,5):
with open('0000{}.xml'.format(i)) as fd:
doc.append(xmltodict.parse(fd.read()))
#This converts the parsed dictionary above to the
#required data frame and works well.
def Dict_toDF (xml_dict):
logData_list = []
for xmlval in xml_dict:
channel_list = xmlval['logs']['log']['logData']['mnemonicList'].split(",")
temp = [i.split(",") for i in xmlval ['logs']['log']['logData']['data']]
temp.insert(0, xmlval ['logs']['log']['logData']['unitList'].split(","))
logData_list.extend(temp)
return pd.DataFrame(np.array(logData_list).reshape(len(logData_list),len(channel_list)), columns = channel_list).drop_duplicates()
df = Dict_toDF (doc)
Came across这个类似的问题,只是在子文件夹中没有多个文件被视为一个文件。它使用os.chdir()
进入每个目录并应用某些功能。似乎是一种可能的方法。
directories = [os.path.abspath(x[0]) for x in os.walk(directory_to_check)]
directories.remove(os.path.abspath(directory_to_check)) # If you don't want your main directory included
for i in directories:
os.chdir(i) # Change working Directory
my_function(i)