过去几天我一直试图打开并读取某个XML文件(DATEXII格式),但到目前为止还没有成功。它涉及来自NDW Open Data website(荷兰道路和交通数据数据库)的流量数据,XML文件来源的超链接。树的头部就像in this picture并继续like this,另请参阅下面的代码段。虽然这些只是在一起形成了很小一部分数据。
<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header/> -
<soapenv:Body>
-
<d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
-
<exchange xmlns="http://datex2.eu/schema/2/2_0">
-
<supplierIdentification>
<country>nl</country>
<nationalIdentifier>NLNDW</nationalIdentifier>
</supplierIdentification>
</exchange>
-
<payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
<publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
-
<publicationCreator>
<country>nl</country>
<nationalIdentifier>NLNDW</nationalIdentifier>
</publicationCreator>
<measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
<headerInformation>
<confidentiality>noRestriction</confidentiality>
<informationStatus>real</informationStatus>
</headerInformation>
-
<siteMeasurements>
<measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
<measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
-
<measuredValue index="1">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>60</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="2">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>0</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="3">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>0</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="4">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>60</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="5">
-
<measuredValue>
-
<basicData xsi:type="TrafficSpeed">
-
<averageVehicleSpeed numberOfInputValuesUsed="1">
<speed>38</speed>
</averageVehicleSpeed>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="6">
-
<measuredValue>
-
<basicData xsi:type="TrafficSpeed">
-
<averageVehicleSpeed numberOfInputValuesUsed="0">
<speed>-1</speed>
</averageVehicleSpeed>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="7">
&#13;
理想情况下,我希望在Jupyter Notebook中将Python信息作为DataFrame加载,因此如果数据允许,我可以执行一些预测分析。我已经尝试过使用ElementTree,像这样的lxml,受到众多其他线程的启发:
# Standard Packages
import pandas as pd
import numpy as np
# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml
os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")
xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually
def xml2df(xml_data):
root = ET.XML(xml_data) # element tree
all_records = [] #This is our record list which we will convert into a
dataframe
for i, child in enumerate(root): #Begin looping through our root tree
record = {} #Place holder for our record
for subchild in child: #iterate through the subchildren
record[subchild.tag] = subchild.text #Extract the text create a new
dictionary key, value pair
all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame
print(xml2df(xml_file))
虽然这只返回带有第一行的单个条目,如列名:d2LogicalModel,row:0,entry:None。
我能够在Microsoft Edge中看到树状结构有困难,需要大量的CPU(Notepad ++和插件XMLtools也足够了,但崩溃了&#34;更大&#34;大小的文件,即&gt; 20MB)。虽然,在我看来,这种结构仍然难以理解。层数太多,我不知道如何用正确的子子等定义xml2df()
等。
我的问题归结为,首先,我如何能够用数据识别变量/列?在此概述我要导入的相关数据。其次,如何将其导入DataFrame?
注意:由于DATEXII格式是欧洲交通数据的标准,我希望他们的导游能够提供帮助(见documents),但它们对我来说还没有意义。也许他们会对你们任何人:)
非常感谢任何帮助!
答案 0 :(得分:1)
考虑使用XSLT专用转换语言将嵌套的XML输入源转换为更扁平的结构,该语言旨在将XML文件转换为其他XML,HTML甚至文本(CSV / TAB)。因此,请考虑以下XSLT,它将原始XML转换为表格格式的逗号分隔值,以便导入带有read_csv()
的pandas:
XSLT (另存为.xsl文件,一个特殊的xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:pub="http://datex2.eu/schema/2/2_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/soapenv:Envelope">
<xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
<xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
<xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
<xsl:text>
</xsl:text>
<xsl:apply-templates select="soapenv:Body"/>
</xsl:template>
<xsl:template match="soapenv:Body">
<xsl:apply-templates select="d2LogicalModel"/>
</xsl:template>
<xsl:template match="d2LogicalModel">
<xsl:apply-templates select="pub:payloadPublication"/>
</xsl:template>
<xsl:template match="pub:payloadPublication">
<xsl:apply-templates select="pub:siteMeasurements"/>
</xsl:template>
<xsl:template match="pub:siteMeasurements">
<xsl:apply-templates select="pub:measuredValue"/>
</xsl:template>
<xsl:template match="pub:measuredValue">
<xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
@index,',',
pub:measuredValue/pub:basicData/@xsi:type,',',
descendant::pub:vehicleFlowRate,',',
descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
descendant::pub:speed)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
<强>的Python 强>
from io import StringIO
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')
# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING
result = str(transform(doc))
# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))
输出 (父节点值成为具有不同数字数据的重复指标)
print(df)
# publicationTime country nationalIdentifier msmtSiteTableRef_targetClass msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass msmtSiteRef_version msmtSiteRef_id measurementTimeDefault measuredValue_index basicData_type vehicleFlowRate averageVehicleSpeed_numberOfInputValues averageVehicleSpeed_value
# 0 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 1 TrafficFlow 60.0 NaN NaN
# 1 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 2 TrafficFlow 0.0 NaN NaN
# 2 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 3 TrafficFlow 0.0 NaN NaN
# 3 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 4 TrafficFlow 60.0 NaN NaN
# 4 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 5 TrafficSpeed NaN 1.0 38.0
# 5 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 6 TrafficSpeed NaN 0.0 1.0