转换为CSV时,在XML节点中保留HTML格式

时间:2018-04-26 09:25:36

标签: xml csv xslt

我有一些很长的XML文件,其中包含一些节点中的HTML格式。例如:

<note encodinganalog="isadg361 marc500">
    <p>Published, with some changes, as "In Small Townlands" in <title render="italic">Death of a Naturalist</title> (1966).</p>
    <p>See also copy of "In Small Towlands" at reference <ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney 1/2/48.</ref></p>
  </note>

或者:

<scopecontent encodinganalog="isadg331 marc520">
  <head>Scope and Content</head>
  <p>The Seamus Heaney Collection comprises typescript and manuscript poems, many of which were later pulished in <title render="italic">Death of a Naturalist</title> and<title render="italic"> Door into the Dark</title>. There is also a short story titled<title render="italic"> The Blackberry Gatherers</title> and 8 letters to Philip Hobsbaum, including discussion of Heaney's work and The Group meetings in Belfast.</p>
</scopecontent>

或者:

<altformavail type="isadg342 marc530">
<head>Copies in Other Formats</head>
<p>Many of the poems were published, sometimes with changes, in <bibref><title render="italic">Death of a Naturalist</title> <imprint>(Faber, <date normal="1966">1966)</date></imprint></bibref> and <bibref><title render="italic">Door into the Dark</title> <imprint>(Faber, <date normal="1969">1969).</date></imprint></bibref></p>
  </altformavail>

在所有示例中,我希望保留所提及的所有内容(<head><p><title>等),但是当我使用{{3}等在线工具时或者将XML作为数据源导入Excel,节点内的标签成为列标题(例如,<note>节点<title render="italic">Death of a Naturalist</title>中的标签被拆分,以便&#39; title&#39;和& #39;渲染&#39;成为列自然主义者的死亡&#39;以及&#39;斜体&#39;他们的价值观。

有一个XSLT已经应用于XML以进行一些小的更改。

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<!-- 1. identity template copies everything as is -->
<xsl:template match="node() | @*">
    <xsl:copy>
        <xsl:apply-templates select="node() | @*" />
    </xsl:copy>
</xsl:template>

<!-- 2. if there is an id attribute, delete it -->
<xsl:template match="unitid/@id" />

<!-- 3. move the value of the id attribute into the element value -->
<xsl:template match="unitid[@id]/text()">
    <xsl:value-of select="../@id | ../../@id" />
</xsl:template>

<!-- 4. create a <physloc> element inside <did> -->
<xsl:template match="did[not(physloc)]">
    <xsl:copy>
        <xsl:apply-templates select="node() | @*" />
        <physloc><xsl:value-of select="unitid" /></physloc>
    </xsl:copy>
</xsl:template>

是否可以重写此XSLT以保留所有节点的格式,以便在将其转换为CSV时保留这些标记?

最终结果:

    title, unitid, date, note, scopecontent, altformavail 
Death of a Naturalist , heaney.01.02.48 , 1966 , '<p>Published, with some changes, as "In Small Townlands" in <title render="italic">Death of a Naturalist</title> (1966).</p>
    <p>See also copy of "In Small Towlands" at reference <ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney 1/2/48.</ref></p>' , '<head>Scope and Content</head>
      <p>The Seamus Heaney Collection comprises typescript and manuscript poems, many of which were later pulished in <title render="italic">Death of a Naturalist</title> and<title render="italic"> Door into the Dark</title>. There is also a short story titled<title render="italic"> The Blackberry Gatherers</title> and 8 letters to Philip Hobsbaum, including discussion of Heaney's work and The Group meetings in Belfast.</p>' , '<head>Copies in Other Formats</head>
    <p>Many of the poems were published, sometimes with changes, in <bibref><title render="italic">Death of a Naturalist</title> <imprint>(Faber, <date normal="1966">1966)</date></imprint></bibref> and <bibref><title render="italic">Door into the Dark</title> <imprint>(Faber, <date normal="1969">1969).</date></imprint></bibref></p>' 

注意:它不仅具有此级别格式的<note>, <scopecontent> and <altformavail>标记,而且可以应用于XML中的所有节点。

1 个答案:

答案 0 :(得分:1)

最简单的解决方案是使用XSLT生成CSV本身,因为这样可以更好地控制输出。

给出以下输入:

<elements>
    <element unitId="heaney.01.02.48">
        <title>Death of a Naturalist</title>
        <date>1966</date>
        <note encodinganalog="isadg361 marc500">
            <p>
                Published, with some changes, as "In Small Townlands" in
                <title render="italic">Death of a Naturalist</title>
                (1966).
            </p>
            <p>
                See also copy of "In Small Towlands" at reference
                <ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney
                    1/2/48.</ref>
            </p>
        </note>
        <scopecontent encodinganalog="isadg331 marc520">
            <head>Scope and Content</head>
            <p>
                The Seamus Heaney Collection comprises typescript and manuscript
                poems, many of which were later pulished in
                <title render="italic">Death of a Naturalist</title>
                and
                <title render="italic"> Door into the Dark</title>
                . There is also a short story titled
                <title render="italic"> The Blackberry Gatherers</title>
                and 8 letters to Philip Hobsbaum, including discussion of Heaney's
                work and The Group meetings in Belfast.
            </p>
        </scopecontent>
        <altformavail type="isadg342 marc530">
            <head>Copies in Other Formats</head>
            <p>
                Many of the poems were published, sometimes with changes, in
                <bibref>
                    <title render="italic">Death of a Naturalist</title>
                    <imprint>
                        (Faber,
                        <date normal="1966">1966)</date>
                    </imprint>
                </bibref>
                and
                <bibref>
                    <title render="italic">Door into the Dark</title>
                    <imprint>
                        (Faber,
                        <date normal="1969">1969).</date>
                    </imprint>
                </bibref>
            </p>
        </altformavail>
    </element>
</elements>

您可以使用以下XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" omit-xml-declaration="yes"/>
    <xsl:template match="/">
        <xsl:text>title,unitid,date,note,scopecontent,altformavail</xsl:text>
        <xsl:for-each select="elements/element">
            <!-- insert new line -->
            <xsl:text>&#xa;</xsl:text>
            <xsl:apply-templates select="./title" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./@unitId" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./date" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./note" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./scopecontent" />
            <xsl:text>,</xsl:text>
           <xsl:apply-templates select="./altformavail" />
       </xsl:for-each>

    </xsl:template>
    <xsl:template match="title|date|note|scopecontent|altformavail">
        <!-- surround the contents with quotes -->
        <xsl:text>'</xsl:text>
        <xsl:copy-of select="."/> 
        <xsl:text>'</xsl:text>
    </xsl:template>
</xsl:stylesheet>

这会生成带有换行符和保留多个空格的CSV。如果我们想要规范化空间,我们需要深入研究原始XML:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" omit-xml-declaration="yes" />
    <xsl:template match="/">
        <xsl:text>title,unitid,date,note,scopecontent,altformavail</xsl:text>
        <xsl:for-each select="elements/element">
            <xsl:text>&#xa;</xsl:text>
            <xsl:apply-templates select="./title" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./@unitId" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./date" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./note" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./scopecontent" />
            <xsl:text>,</xsl:text>
            <xsl:apply-templates select="./altformavail" />
        </xsl:for-each>
    </xsl:template>

    <xsl:template match="title|date|note|scopecontent|altformavail">
        <xsl:text>'</xsl:text>
        <xsl:apply-templates /> <!-- We'll define additional processing rules for each node inside -->
        <xsl:text>'</xsl:text>
    </xsl:template>

    <!-- When an element is encountered... -->
    <xsl:template match="*"> 
        <!-- ...output an element with the same name... -->
        <xsl:element name="{name()}" >
            <!-- ...and attributes... -->
            <xsl:copy-of select="./@*" />
            <!-- ...applying the rules for element and text nodes to whatever's inside -->
            <xsl:apply-templates />
        </xsl:element>
    </xsl:template>

    <!-- When a text node is encountered, print its content with spaces normalized  -->
    <xsl:template match="text()">
        <!-- If the node is empty, don't print anything -->
        <xsl:if test="normalize-space()">
            <xsl:value-of select="concat(normalize-space(), ' ')" />
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

如果您需要对文字内容进行其他处理(例如,在CSV列中转义逗号),请修改concat(normalize-space(), ' ')以满足您的需求。