如何使用包含多个名称空间的spark读取XML文件?

时间:2018-10-11 11:15:58

标签: apache-spark databricks

我正在使用Azure-Databricks中的spark-xml库。但是我无法正确选择读取包含多个名称空间的此类文件。

因此,我正在寻求一些帮助,以在选项或任何其他方法中对此代码进行编码。

这里是剥离的样品。

<msg:TrainTrackingMessage xmlns:msg="be:brail:nmbs-it:esb:msg:traintraffic" xmlns:trtf="be:brail:nmbs-it:esb:traintraffic" xmlns:gene="be:brail:nmbs-it:esb:generalelements">
<gene:Event>
    <gene:EventType>tracking</gene:EventType>
    <gene:EventMessage>TrainTracking</gene:EventMessage>
    <gene:EventTimeStamp>2018-09-27T14:13:15.458439</gene:EventTimeStamp>
</gene:Event>
<gene:Train>
    <gene:TrainKey>
        <gene:CirculationType>1</gene:CirculationType>
        <gene:Discriminator>0</gene:Discriminator>
        <gene:DepartureDate>2018-09-27</gene:DepartureDate>
    </gene:TrainKey>
    <gene:TrainNumberEBP>2E0xaZ12</gene:TrainNumberEBP>
    <gene:TrainDetails>
        <gene:TrainGroup>1</gene:TrainGroup>
    </gene:TrainDetails>
</gene:Train>
<trtf:TrainTracking>
    <gene:ItineraryPoint>
        <gene:PtcarIdentification>592</gene:PtcarIdentification>
        <gene:OrderNumber>150</gene:OrderNumber>
        <gene:ItineraryPointDetails>
            <gene:OperationCode>=</gene:OperationCode>
            <gene:CommercialStop>2</gene:CommercialStop>
        </gene:ItineraryPointDetails>
        <gene:ItineraryPointTimeInfo>
            <gene:ArrivalTime>14:10:47</gene:ArrivalTime>
            <gene:DepartureTime>14:10:54</gene:DepartureTime>
        </gene:ItineraryPointTimeInfo>
        <gene:ItineraryTechnicalInfo>
            <gene:EngineType>21</gene:EngineType>
            <gene:TractionCode>E</gene:TractionCode>
            <gene:TractionOperator/>
        </gene:ItineraryTechnicalInfo>
    </gene:ItineraryPoint>
    <trtf:GPSPosition>
        <trtf:GPSAltitude>51</trtf:GPSAltitude>
    </trtf:GPSPosition>
    <trtf:Libelle>E2412</trtf:Libelle>
    <trtf:TrackingPointInfo>
        <trtf:TrackingType>2</trtf:TrackingType>
        <trtf:TrackingOrigin>0</trtf:TrackingOrigin>
    </trtf:TrackingPointInfo>
    <trtf:TrackingTimeInfo>
        <trtf:Delay>1639</trtf:Delay>
    </trtf:TrackingTimeInfo>
</trtf:TrainTracking>

1 个答案:

答案 0 :(得分:1)

如果人们想找一些熟悉的东西,那就可以了。

import xml.etree.ElementTree as ET
xmlfiles = dbutils.fs.ls(storage_mount_name)

##Get attribute names (for now I took all leafs of the xml structure)
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()
attributes = [node.tag for node in root.iter() if len(node)==0]
clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]

#Create Dataframe and save it as csv
df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
for xf in xmlfiles:
    afile = xf.path.replace('dbfs:','/dbfs')
    root = ET.parse(afile).getroot()
    df.loc[afile] = [node.text for node in root.iter() if node.tag in attributes]