如何解析xml以提取文档的字段?

时间:2017-03-23 10:06:57

标签: xml excel bash ksh

我正在尝试做一些关于如何从Informatica Powercenter中已经完成的Mappings轻松制作文档的研究,并且由于不同选项的数量,最初的方法对我来说很难。这里遵循的方法是根据需要多次访问映射中的每个框,将信息复制到word文档,格式化,每周执行几千次。

现在,我认为解决方案的想法很低:将映射导出到XML,使用脚本(或程序,我已经尝试过几次excel,unseccesfully)解析所述XML更容易复制 - 粘贴,这样可以改善我的生活。

XML看起来像这样(简化为尽可能少的行来做一个例子,它可能不是100%有效但原始的XML显然也是值得分配的东西,我把它放在与任何东西无关的东西上价值,而不是每一次字符串):

Type 1 Document:

   <!DOCTYPE POWERMART SYSTEM "ValueAssigned">
<POWERMART CREATION_DATE="ValueAssigned" REPOSITORY_VERSION="ValueAssigned">
<REPOSITORY NAME="ValueAssigned" VERSION="ValueAssigned" CODEPAGE="ValueAssigned" DATABASETYPE="ValueAssigned">
<FOLDER NAME="ValueAssigned" GROUP="" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned">
    <CONFIG DESCRIPTION ="ValueAssigned" ISDEFAULT ="YES" NAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
        <ATTRIBUTE NAME ="Field1" VALUE =""/>
        <ATTRIBUTE NAME ="Field2" VALUE ="NO"/>
    <WORKFLOW DESCRIPTION ="" ISENABLED ="ValueAssigned" ISRUNNABLESERVICE ="ValueAssigned" ISSERVICE ="ValueAssigned" ISVALID ="ValueAssigned" NAME ="ValueAssigned" REUSABLE_SCHEDULER ="ValueAssigned" SCHEDULERNAME ="ValueAssigned" SERVERNAME ="ValueAssigned" SERVER_DOMAINNAME ="ValueAssigned" SUSPEND_ON_ERROR ="ValueAssigned" TASKS_MUST_RUN_ON_SERVER ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
        <SCHEDULER DESCRIPTION ="" NAME ="SchedulerName" REUSABLE ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
            <SCHEDULEINFO SCHEDULETYPE ="ONDEMAND"/>
        </SCHEDULER>
        <TASK DESCRIPTION ="ValueAssigned" NAME ="Start" REUSABLE ="NO" TYPE ="Start" VERSIONNUMBER ="1"/>
        <SESSION DESCRIPTION ="ValueAssigned" ISVALID ="ValueAssigned" MAPPINGNAME ="ValueAssigned" NAME ="ValueAssigned" REUSABLE ="ValueAssigned" SORTORDER ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
            <SESSTRANSFORMATIONINST ISREPARTITIONPOINT ="ValueAssigned" PARTITIONTYPE ="ValueAssigned" PIPELINE ="ValueAssigned" SINSTANCENAME ="ValueAssigned" STAGE ="ValueAssigned" TRANSFORMATIONNAME ="ValueAssigned" TRANSFORMATIONTYPE ="Target Definition">
                <ATTRIBUTE NAME ="ValueAssigned" VALUE ="ValueAssigned"/>
                <ATTRIBUTE NAME ="ValueAssigned" VALUE ="ValueAssigned"/>
            </SESSTRANSFORMATIONINST>

因此,如果我们专注于任何一个标签,例如

<CONFIG DESCRIPTION ="Default session configuration object" ISDEFAULT ="YES" NAME ="default_session_config" VERSIONNUMBER ="29">
        <ATTRIBUTE NAME ="Field1" VALUE =""/>
        <ATTRIBUTE NAME ="Field2" VALUE ="NO"/>

我们可以看到有一个Tag,CONFIG DESCRIPTION,其次是一些属性名称。我正在考虑的其中一个选项有点天真,但是如果我要将它转到列,excel或类似的,我可以看到一个带有根标签的行,并且在那个不同的类别下,以及那个细分到达我能看到的地方:好的这是标签,这是一个包含所有值的列,我将它复制到我的word文档并称之为一天。因为在XML中有300到900行之间的任何地方,并且它既不容易看到也不容易使用,因为引号,常量标记,列没有对齐,因为行不具有相同的长度(所以我不能使用列模式)...

我放了另一种类型的文件,以防它更清楚地了解信息的差异,以及为什么我不直接跳进我自己的解析器:

  <?xml version="ValueAssigned" encoding="ValueAssigned"?>
<!DOCTYPE POWERMART SYSTEM "ValueAssigned">
<POWERMART CREATION_DATE="ValueAssigned" REPOSITORY_VERSION="ValueAssigned">
<REPOSITORY NAME="ValueAssigned" VERSION="ValueAssigned" CODEPAGE="ValueAssigned" DATABASETYPE="ValueAssigned">
<FOLDER NAME="ValueAssigned" GROUP="ValueAssigned" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned">
    <SOURCE BUSINESSNAME ="ValueAssigned" DATABASETYPE ="ValueAssigned" DBDNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" NAME ="ValueAssigned" OBJECTVERSION ="ValueAssigned" OWNERNAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
        <SOURCEFIELD BUSINESSNAME ="ValueAssigned" DATATYPE ="ValueAssigned" DESCRIPTION ="ValueAssigned" FIELDNUMBER ="ValueAssigned" FIELDPROPERTY ="ValueAssigned" FIELDTYPE ="ValueAssigned" HIDDEN ="ValueAssigned" KEYTYPE ="ValueAssigned" LENGTH ="ValueAssigned" LEVEL ="ValueAssigned" NAME ="ValueAssigned" NULLABLE ="ValueAssigned" OCCURS ="ValueAssigned" OFFSET ="ValueAssigned" PHYSICALLENGTH ="ValueAssigned" PHYSICALOFFSET ="ValueAssigned" PICTURETEXT ="ValueAssigned" PRECISION ="ValueAssigned" SCALE ="ValueAssigned" USAGE_FLAGS ="ValueAssigned"/>
<FOLDER NAME="ValueAssigned" GROUP="ValueAssigned" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned">
    <SOURCE BUSINESSNAME ="ValueAssigned" CRCVALUE ="ValueAssigned" DATABASETYPE ="ValueAssigned" DBDNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" IBMCOMP ="ValueAssigned" NAME ="ValueAssigned" OBJECTVERSION ="ValueAssigned" OWNERNAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned">
        <FLATFILE CODEPAGE ="ValueAssigned" CONSECDELIMITERSASONE ="ValueAssigned" DELIMITED ="ValueAssigned" DELIMITERS ="ValueAssigned" ESCAPE_CHARACTER ="ValueAssigned" KEEPESCAPECHAR ="ValueAssigned" LINESEQUENTIAL ="ValueAssigned" MULTIDELIMITERSASAND ="ValueAssigned" NULLCHARTYPE ="ValueAssigned" NULL_CHARACTER ="ValueAssigned" PADBYTES ="ValueAssigned" QUOTE_CHARACTER ="ValueAssigned" REPEATABLE ="ValueAssigned" ROWDELIMITER ="ValueAssigned" SHIFTSENSITIVEDATA ="ValueAssigned" SKIPROWS ="ValueAssigned" STRIPTRAILINGBLANKS ="ValueAssigned"/>
        <SOURCEFIELD BUSINESSNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" FIELDNUMBER ="ValueAssigned" FIELDPROPERTY ="ValueAssigned" FIELDTYPE ="ValueAssigned" HIDDEN ="ValueAssigned" LENGTH ="ValueAssigned" LEVEL ="ValueAssigned" NAME ="ValueAssigned" OCCURS ="ValueAssigned" OFFSET ="ValueAssigned" PHYSICALLENGTH ="ValueAssigned" PHYSICALOFFSET ="ValueAssigned">

1 个答案:

答案 0 :(得分:0)

我过去做过类似的事情:将xml文件转换为平坦的TXT文件。 您的一个问题是XML是一种具有类似列表结构的嵌套格式。

有一件事就是以这种方式压扁它:

<CONFIG DESCRIPTION ="Default session configuration object" ISDEFAULT ="YES" NAME ="default_session_config" VERSIONNUMBER ="29">
        <ATTRIBUTE NAME ="Field1" VALUE =""/>
        <ATTRIBUTE NAME ="Field2" VALUE ="NO"/>
</CONFIG>

变为

CONFIG.DESCRIPTION = "Default session configuration object"
CONFIG.ISDEFAULT = ="YES"
CONFIG.NAME ="default_session_config"
CONFIG.VERSIONNUMBER ="29"
CONFIG.ATTRIBUTE[1].NAME="Field1"
CONFIG.ATTRIBUTE[1].VALUE =""
CONFIG.ATTRIBUTE[2].NAME="Field2"
CONFIG.ATTRIBUTE[2].VALUE ="NO"

基本上具有Xpath =值格式。 你可以使用python与XML库或XSLT和xsl模板实现这一点。