为XML Parsing创建动态变量

时间:2016-10-11 21:25:06

标签: python xml python-3.x parsing lxml

我对此非常陌生,而且我已经尝试过搜索,但我发现的任何内容都无法为我工作。

我有xml数据,看起来像这样

<datainfo>
   <data>
       <info State="1" Reason="x" Start="01/01/2016 00:00:00.000" End="01/01/2016 02:00:00.000"></info>
       <info State="1" Reason="y" Start="01/01/2016 02:00:00.000" End="01/01/2016 02:01:00.000">
            <moreinfo Start="01/01/2016 02:00:00.000" End="01/01/2016 02:00:30.000"/>
            <moreinfo Start="01/01/2016 02:00:30.000" End="01/01/2016 02:01:00.000"/>
       </info>
       <info State="2" Start="01/01/2016 02:01:00.000" End="01/01/2016 02:10:00.000"></info>
       ...
   </data>
</datainfo>

我希望在特定日期找到状态{1,2,...}花费了多少时间{x,y,...}并将该打印件以.csv格式打印到后面读取在excel中。

我遇到的问题是我无法使用静态变量,因为数百种不同的状态有数百种不同的原因,并且它们会不断变化。

如果我不清楚,请告诉我,我是新手,非常感谢所有人的帮助。

编辑:这是我目前所拥有的,希望这将清除我想要做的事情。

from datetime import datetime
from lxml import etree as ET

def parseXML(file):
    handler = open(file, "r") 
    tree = ET.parse(handler)  
    info_list = tree.xpath('//info')
    root = tree.getroot()
    dictionary = {}
    info_len = len(info_list)

    for i in range(info_len):
         info=root[0][0][i]
         info_attribs = info.attrib
         end = info_attribs[u'End']
         start = info_attribs[u'Start']
         FMT = '%m/%d/%Y %H:%M:%S.%f'
         tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
         t_dif = (tdelta.total_seconds()) / 60
         try:
             dictionary[info_attribs[u'State'] + status_attribs[u'Reason']] = t_dif
         except:
             continue

我试图遍历每一行,找到状态和原因,然后将它们添加到字典中。如果该状态和原因的条目已存在,我想将其添加到当前值。

如果我应该提供更多信息,请告诉我!

编辑#2:

我正在寻找的输出将是.csv的形式,结构如下:

State - Reason, [Total time spent in State 1 for x reason]

2 个答案:

答案 0 :(得分:3)

您可以使用 defaultdict 使用列表作为值来重复出现密钥,您也可以使用 xpath 过滤信息节点,以便仅找到同时具有这两者的节点你想要的属性除了以外不需要:

x = """<datainfo>
   <data>
       <info State="1" Reason="x" Start="01/01/2016 00:00:00.000" End="01/01/2016 02:00:00.000"></info>
       <info State="1" Reason="y" Start="01/01/2016 02:00:00.000" End="01/01/2016 02:01:00.000">
            <moreinfo Start="01/01/2016 02:00:00.000" End="01/01/2016 02:00:30.000"/>
            <moreinfo Start="01/01/2016 02:00:30.000" End="01/01/2016 02:01:00.000"/>
       </info>
       <info State="2" Start="01/01/2016 02:01:00.000" End="01/01/2016 02:10:00.000"></info>
   </data>
</datainfo>"""

from collections import defaultdict
import lxml.etree as et
from datetime import datetime

FMT = '%m/%d/%Y %H:%M:%S.%f'
tree = et.fromstring(x)
d = defaultdict(list)

for node in tree.xpath("//data/info[@Reason and @State]"):
    state = node.attrib["State"]
    reason = node.attrib["Reason"]
    end = node.attrib["End"]
    start = node.attrib[u'Start']
    tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
    d[state, reason].append((tdelta.total_seconds()) / 60))

print(d)

根据您希望数据查找重复键的方式将决定您如何写入csv,如果您想要每行一行:

import csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    for k,v in d.items():
        for val in v:
            wr.writerow([k] + val)

如果你真的想总结:

d = defaultdict(float)

for node in tree.xpath("//data/info[@Reason and @State]"):
    state = node.attrib["State"]
    reason = node.attrib["Reason"]
    end = node.attrib["End"]
    start = node.attrib[u'Start']
    tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
    d[state, reason] += (tdelta.total_seconds()) / 60

然后:

import csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerows(d.items())

答案 1 :(得分:0)

这假设您已将xml解析为数组数组

import csv

# This is assuming you have your xml parsed into an array of arrays  [['state', 'reason'], ['state', 'reason']]
# example of array format
data = [['1', 'x'], ['1', 'y'], ['2', 'z']]

with open("output.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(data)