我有一些很长的XML文件,其中包含一些节点中的HTML格式。例如:
<note encodinganalog="isadg361 marc500">
<p>Published, with some changes, as "In Small Townlands" in <title render="italic">Death of a Naturalist</title> (1966).</p>
<p>See also copy of "In Small Towlands" at reference <ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney 1/2/48.</ref></p>
</note>
或者:
<scopecontent encodinganalog="isadg331 marc520">
<head>Scope and Content</head>
<p>The Seamus Heaney Collection comprises typescript and manuscript poems, many of which were later pulished in <title render="italic">Death of a Naturalist</title> and<title render="italic"> Door into the Dark</title>. There is also a short story titled<title render="italic"> The Blackberry Gatherers</title> and 8 letters to Philip Hobsbaum, including discussion of Heaney's work and The Group meetings in Belfast.</p>
</scopecontent>
或者:
<altformavail type="isadg342 marc530">
<head>Copies in Other Formats</head>
<p>Many of the poems were published, sometimes with changes, in <bibref><title render="italic">Death of a Naturalist</title> <imprint>(Faber, <date normal="1966">1966)</date></imprint></bibref> and <bibref><title render="italic">Door into the Dark</title> <imprint>(Faber, <date normal="1969">1969).</date></imprint></bibref></p>
</altformavail>
在所有示例中,我希望保留所提及的所有内容(<head>
,<p>
,<title>
等),但是当我使用{{3}等在线工具时或者将XML作为数据源导入Excel,节点内的标签成为列标题(例如,<note>
节点<title render="italic">Death of a Naturalist</title>
中的标签被拆分,以便&#39; title&#39;和& #39;渲染&#39;成为列自然主义者的死亡&#39;以及&#39;斜体&#39;他们的价值观。
有一个XSLT已经应用于XML以进行一些小的更改。
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- 1. identity template copies everything as is -->
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*" />
</xsl:copy>
</xsl:template>
<!-- 2. if there is an id attribute, delete it -->
<xsl:template match="unitid/@id" />
<!-- 3. move the value of the id attribute into the element value -->
<xsl:template match="unitid[@id]/text()">
<xsl:value-of select="../@id | ../../@id" />
</xsl:template>
<!-- 4. create a <physloc> element inside <did> -->
<xsl:template match="did[not(physloc)]">
<xsl:copy>
<xsl:apply-templates select="node() | @*" />
<physloc><xsl:value-of select="unitid" /></physloc>
</xsl:copy>
</xsl:template>
是否可以重写此XSLT以保留所有节点的格式,以便在将其转换为CSV时保留这些标记?
最终结果:
title, unitid, date, note, scopecontent, altformavail
Death of a Naturalist , heaney.01.02.48 , 1966 , '<p>Published, with some changes, as "In Small Townlands" in <title render="italic">Death of a Naturalist</title> (1966).</p>
<p>See also copy of "In Small Towlands" at reference <ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney 1/2/48.</ref></p>' , '<head>Scope and Content</head>
<p>The Seamus Heaney Collection comprises typescript and manuscript poems, many of which were later pulished in <title render="italic">Death of a Naturalist</title> and<title render="italic"> Door into the Dark</title>. There is also a short story titled<title render="italic"> The Blackberry Gatherers</title> and 8 letters to Philip Hobsbaum, including discussion of Heaney's work and The Group meetings in Belfast.</p>' , '<head>Copies in Other Formats</head>
<p>Many of the poems were published, sometimes with changes, in <bibref><title render="italic">Death of a Naturalist</title> <imprint>(Faber, <date normal="1966">1966)</date></imprint></bibref> and <bibref><title render="italic">Door into the Dark</title> <imprint>(Faber, <date normal="1969">1969).</date></imprint></bibref></p>'
注意:它不仅具有此级别格式的<note>, <scopecontent> and <altformavail>
标记,而且可以应用于XML中的所有节点。
答案 0 :(得分:1)
最简单的解决方案是使用XSLT生成CSV本身,因为这样可以更好地控制输出。
给出以下输入:
<elements>
<element unitId="heaney.01.02.48">
<title>Death of a Naturalist</title>
<date>1966</date>
<note encodinganalog="isadg361 marc500">
<p>
Published, with some changes, as "In Small Townlands" in
<title render="italic">Death of a Naturalist</title>
(1966).
</p>
<p>
See also copy of "In Small Towlands" at reference
<ref target="heaney.01.02.48" role="didid" actuate="onrequest">Heaney
1/2/48.</ref>
</p>
</note>
<scopecontent encodinganalog="isadg331 marc520">
<head>Scope and Content</head>
<p>
The Seamus Heaney Collection comprises typescript and manuscript
poems, many of which were later pulished in
<title render="italic">Death of a Naturalist</title>
and
<title render="italic"> Door into the Dark</title>
. There is also a short story titled
<title render="italic"> The Blackberry Gatherers</title>
and 8 letters to Philip Hobsbaum, including discussion of Heaney's
work and The Group meetings in Belfast.
</p>
</scopecontent>
<altformavail type="isadg342 marc530">
<head>Copies in Other Formats</head>
<p>
Many of the poems were published, sometimes with changes, in
<bibref>
<title render="italic">Death of a Naturalist</title>
<imprint>
(Faber,
<date normal="1966">1966)</date>
</imprint>
</bibref>
and
<bibref>
<title render="italic">Door into the Dark</title>
<imprint>
(Faber,
<date normal="1969">1969).</date>
</imprint>
</bibref>
</p>
</altformavail>
</element>
</elements>
您可以使用以下XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>title,unitid,date,note,scopecontent,altformavail</xsl:text>
<xsl:for-each select="elements/element">
<!-- insert new line -->
<xsl:text>
</xsl:text>
<xsl:apply-templates select="./title" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./@unitId" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./date" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./note" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./scopecontent" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./altformavail" />
</xsl:for-each>
</xsl:template>
<xsl:template match="title|date|note|scopecontent|altformavail">
<!-- surround the contents with quotes -->
<xsl:text>'</xsl:text>
<xsl:copy-of select="."/>
<xsl:text>'</xsl:text>
</xsl:template>
</xsl:stylesheet>
这会生成带有换行符和保留多个空格的CSV。如果我们想要规范化空间,我们需要深入研究原始XML:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" omit-xml-declaration="yes" />
<xsl:template match="/">
<xsl:text>title,unitid,date,note,scopecontent,altformavail</xsl:text>
<xsl:for-each select="elements/element">
<xsl:text>
</xsl:text>
<xsl:apply-templates select="./title" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./@unitId" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./date" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./note" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./scopecontent" />
<xsl:text>,</xsl:text>
<xsl:apply-templates select="./altformavail" />
</xsl:for-each>
</xsl:template>
<xsl:template match="title|date|note|scopecontent|altformavail">
<xsl:text>'</xsl:text>
<xsl:apply-templates /> <!-- We'll define additional processing rules for each node inside -->
<xsl:text>'</xsl:text>
</xsl:template>
<!-- When an element is encountered... -->
<xsl:template match="*">
<!-- ...output an element with the same name... -->
<xsl:element name="{name()}" >
<!-- ...and attributes... -->
<xsl:copy-of select="./@*" />
<!-- ...applying the rules for element and text nodes to whatever's inside -->
<xsl:apply-templates />
</xsl:element>
</xsl:template>
<!-- When a text node is encountered, print its content with spaces normalized -->
<xsl:template match="text()">
<!-- If the node is empty, don't print anything -->
<xsl:if test="normalize-space()">
<xsl:value-of select="concat(normalize-space(), ' ')" />
</xsl:if>
</xsl:template>
</xsl:stylesheet>
如果您需要对文字内容进行其他处理(例如,在CSV列中转义逗号),请修改concat(normalize-space(), ' ')
以满足您的需求。