DATEXII XML文件到Python中的DataFrame

时间:2017-11-16 13:39:43

标签: python xml dataframe lxml elementtree

过去几天我一直试图打开并读取某个XML文件(DATEXII格式),但到目前为止还没有成功。它涉及来自NDW Open Data website(荷兰道路和交通数据数据库)的流量数据,XML文件来源的超链接。树的头部就像in this picture并继续like this,另请参阅下面的代码段。虽然这些只是在一起形成了很小一部分数据。



<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Header/> -
  <soapenv:Body>
    -
    <d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      -
      <exchange xmlns="http://datex2.eu/schema/2/2_0">
        -
        <supplierIdentification>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </supplierIdentification>
      </exchange>
      -
      <payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
        <publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
        -
        <publicationCreator>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </publicationCreator>
        <measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
        <headerInformation>
          <confidentiality>noRestriction</confidentiality>
          <informationStatus>real</informationStatus>
        </headerInformation>
        -
        <siteMeasurements>
          <measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
          <measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
          -
          <measuredValue index="1">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="2">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="3">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="4">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="5">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="1">
                  <speed>38</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="6">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="0">
                  <speed>-1</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="7">
&#13;
&#13;
&#13;

理想情况下,我希望在Jupyter Notebook中将Python信息作为DataFrame加载,因此如果数据允许,我可以执行一些预测分析。我已经尝试过使用ElementTree,像这样的lxml,受到众多其他线程的启发:

# Standard Packages
import pandas as pd
import numpy as np

# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml

os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")

xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = [] #This is our record list which we will convert into a 
    dataframe
    for i, child in enumerate(root): #Begin looping through our root tree
        record = {} #Place holder for our record
        for subchild in child: #iterate through the subchildren
            record[subchild.tag] = subchild.text #Extract the text create a new 
    dictionary key, value pair
        all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame

print(xml2df(xml_file))

虽然这只返回带有第一行的单个条目,如列名:d2LogicalModel,row:0,entry:None。

我能够在Microsoft Edge中看到树状结构有困难,需要大量的CPU(Notepad ++和插件XMLtools也足够了,但崩溃了&#34;更大&#34;大小的文件,即&gt; 20MB)。虽然,在我看来,这种结构仍然难以理解。层数太多,我不知道如何用正确的子子等定义xml2df()等。

我的问题归结为,首先,我如何能够用数据识别变量/列?在此概述我要导入的相关数据。其次,如何将其导入DataFrame?

注意:由于DATEXII格式是欧洲交通数据的标准,我希望他们的导游能够提供帮助(见documents),但它们对我来说还没有意义。也许他们会对你们任何人:)

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

考虑使用XSLT专用转换语言将嵌套的XML输入源转换为更扁平的结构,该语言旨在将XML文件转换为其他XML,HTML甚至文本(CSV / TAB)。因此,请考虑以下XSLT,它将原始XML转换为表格格式的逗号分隔值,以便导入带有read_csv()的pandas:

XSLT (另存为.xsl文件,一个特殊的xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:pub="http://datex2.eu/schema/2/2_0"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/soapenv:Envelope">
    <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
    <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
    <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="soapenv:Body"/>
  </xsl:template>

  <xsl:template match="soapenv:Body">
    <xsl:apply-templates select="d2LogicalModel"/>
  </xsl:template>

  <xsl:template match="d2LogicalModel">
    <xsl:apply-templates select="pub:payloadPublication"/>
  </xsl:template>

  <xsl:template match="pub:payloadPublication">
    <xsl:apply-templates select="pub:siteMeasurements"/>
  </xsl:template>

  <xsl:template match="pub:siteMeasurements">
    <xsl:apply-templates select="pub:measuredValue"/>
  </xsl:template>

  <xsl:template match="pub:measuredValue">
    <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                 ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                 @index,',',
                                 pub:measuredValue/pub:basicData/@xsi:type,',',
                                 descendant::pub:vehicleFlowRate,',',
                                 descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                 descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
  </xsl:template>

</xsl:stylesheet>

<强>的Python

from io import StringIO
import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')

# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING 
result = str(transform(doc))

# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))

输出 (父节点值成为具有不同数字数据的重复指标)

print(df)

#           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
# 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
# 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
# 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
# 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
# 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
# 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0