Question

我有一个包含这些标签的XML文件。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DataFlows>
    <DataFlow id="ABC">
            <Flow name="flow4" type="Ingest">
                <Ingest dataSourceName="type1" tableName="table1">
                    <DataSet>
                        <DataSetRef>value1-${d1}-${t1}</DataSetRef>
                        <DataStore>ingest</DataStore>
                    </DataSet>
                    <Mode>Overwrite</Mode>
                </Ingest>
            </Flow>    
        </DataFlow>
        <DataFlow id="MHH" dependsOn="ABC">
            <Flow name="flow5" type="Reconcile">
                <Reconciliation>
                    <Source>QW</Source>
                    <Target>EF</Target>
                    <ComparisonKey>
                        <Column>dealNumber</Column>
                    </ComparisonKey>
    <ReconcileColumns mode="required">
                        <Column>bookId</Column>
                    </ReconcileColumns>
                </Reconciliation>
            </Flow>
            <Flow name="output" type="Export" format="Native">
                <Table publishToSQLServer="true">
                    <DataSet>
                        <DataSetRef>value4_${cob}_${ts}</DataSetRef>
                        <DataStore>recon</DataStore>
                        <Date>${run_date}</Date>
                    </DataSet>
                    <Mode>Overwrite</Mode>
                </Table>
            </Flow>
        </DataFlow>
</DataFlows>

我想使用Python Minimal DOM实现在python中处理这个XML。我只需要在“Reconcile”中的Flow类型中提取DataSet Tag中的信息。

例如：

如果我的Flow Type是“Reconcile”，那么我需要转到名为“output”的下一个Flow标签，并提取DataSetRef，DataSource和Date标签的值。

到目前为止，我已经尝试了下面提到的代码，但我在所有可能的字段中都得到空白值。

#!/usr/bin/python

from xml.dom.minidom import parse

import xml.dom.minidom

# Open XML document using minidom parser

DOMTree = xml.dom.minidom.parse("Store.xml")

collection = DOMTree.documentElement

#if collection.hasAttribute("DataFlows"):

#   print "Root element : %s" % collection.getAttribute("DataFlows")

pretty = DOMTree.toprettyxml()

print "Collectio: %s" % collection

dataflows = DOMTree.getElementsByTagName("DataFlow")

# Print detail of each movie.

for dataflow in dataflows:

   print "*****dataflow*****"

   if dataflow.hasAttribute("dependsOn"):

      print "Depends On is present"

      flows = DOMTree.getElementsByTagName("Flow")

      print "flows"

      for flow in flows:

        print "******flow******"

        if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":

          flowByReconcileType = flow.getAttribute("type")

          TagValue = flow.getElementsByTagName("DataSet")

          print "Tag Value is %s" % TagValue

          print "flow type is: %s" % flowByReconcileType

从那时起，我需要将上面提取的这三个值传递给Unix Shell脚本来处理一些目录。任何帮助将不胜感激。

Answer 1

首先检查您的XML是否格式正确。您缺少根标记，并且您输入错误的双引号，例如<Flow name=“flow4" type="Ingest">

在您的代码中，您正确地抓住了数据流。

您不需要再次查询DOMTree以获取流量，您可以通过这样查询来检查每个数据流的流量：

flows = dataflow.getElementsByTagName("Flow")

您的条件if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":看起来不错，所以为了获得下一个流程项，您可以执行类似的操作，始终检查索引是否在数组内。

for index, flow in enumerate(flows):
    if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":
        if index + 1 < len(flows):
            your_flow = flows[index + 1]

在python中提取特定的XML标记值

1 个答案: