使用XSLT处理文本

时间:2015-03-24 16:15:41

标签: xml xslt

请帮助我X-perts!我有输入XML文档,其中包含< body>包含“结构化”文本的XML标记。 E.g:

<?xml version=1.0"?>
<d:Doc xmlns:d="urn:foo:bar">
<d:Body>
TITLE: An engaging topic with little to
no op-ed-ness (yes the title text wraps...)
PUBLICATION DATE: 24 March 2014
PUBLISHER: The Internet
AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite
TEXT: Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.
</d:Body>
</d:Doc>

...我需要将上面的内容转换成这样的东西:

<?xml version="1.0"?>
<d:Doc xmlns:d="urn:foo:bar">
<d:Body>
<d:Pre qualifier="TITLE">TITLE: An engaging topic with little to
no op-ed-ness (yes the title text wraps...)</d:Pre>
<d:Pre qualifier="DATE">DATE: 24 March 2014</d:Pre>
<d:Pre qualfier="PUBLISHER">PUBLISHER: The Internet</d:Pre>
<d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite</d:Pre>
<d:Pre qualifer="TEXT">TEXT: Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.</d:Pre>
</d:Body>
</d:Doc>

我正在尝试使用XSLT 2.0样式表。好消息是领先的标记(TITLE,DATE,AUTHOR等)是一个受控的词汇;坏消息是这些令牌之后的文本可能会或可能不会包含在一个或多个后续行中。当然,生成的XML必须遵循原始的任何名称空间。

有什么建议吗?

2 个答案:

答案 0 :(得分:3)

不幸的是,XSLT 2.0正则表达式语言不支持零宽度前瞻,所以这一步很难做到,但你可以用两个来做 - 首先标记关键字,然后扩展{{ 1}}元素以涵盖以下文本。

Pre

在标记为<xsl:template match="d:Body"> <xsl:copy> <xsl:variable name="step1" as="node()*"> <xsl:analyze-string select="." regex="^(TITLE|DATE|PUBLISHER|AUTHOR|TEXT):" flags="m"> <xsl:matching-substring> <d:Pre qualifier="{regex-group(1)}"><xsl:value-of select="."/></d:Pre> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:variable> <!-- XXX --> <xsl:for-each-group select="$step1" group-starting-with="d:Pre"> <xsl:if test="self::d:Pre"><!-- ignore the whitespace before the first Pre --> <d:Pre> <xsl:copy-of select="@qualifier" /> <xsl:value-of select="current-group()" separator="" /> </d:Pre> </xsl:if> </xsl:for-each-group> </xsl:copy> </xsl:template> 的位置,XXX变量包含一系列交替的文本节点和step1元素,如下所示:

d:Pre

<d:Pre qualifier="TITLE">TITLE:</d:Pre> An engaging topic with little to no op-ed-ness (yes the title text wraps...) <d:Pre qualifier="DATE">DATE:</d:Pre> 24 March 2014 <d:Pre qualfier="PUBLISHER">PUBLISHER:</d:Pre> The Internet</d:Pre> <d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite <d:Pre qualifer="TEXT">TEXT:</d:Pre> Bacon ipsum dolor amet ut jerky flank, in aliqua kielbasa et meatball officia ea minim t-bone quis beef. Commodo pancetta chicken meatloaf consequat, eu tempor nisi et brisket occaecat aliquip shankle ut pork chop. Reprehenderit anim voluptate irure. 创建最终的for-each-group元素,涵盖下一个d:Pre开头的所有内容:

d:Pre

这几乎就是你所追求的(除了每个部分之后的尾随换行符在其<d:Pre qualifier="TITLE">TITLE: An engaging topic with little to no op-ed-ness (yes the title text wraps...) </d:Pre><d:Pre qualifier="DATE">DATE: 24 March 2014 </d:Pre><d:Pre qualfier="PUBLISHER">PUBLISHER: The Internet </d:Pre><d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite </d:Pre><d:Pre qualifer="TEXT">TEXT: Bacon ipsum dolor amet ut jerky flank, in aliqua kielbasa et meatball officia ea minim t-bone quis beef. Commodo pancetta chicken meatloaf consequat, eu tempor nisi et brisket occaecat aliquip shankle ut pork chop. Reprehenderit anim voluptate irure. </d:Pre> 内而不是在每个部分之间)之后。

答案 1 :(得分:1)

假设XSLT 3.0(我知道你说过XSLT 2.0,但Ian已经给你一个不错的XSLT 2.0解决方案)和Saxon 9.6 PE或EE你可以使用

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"
  xmlns:d="urn:foo:bar">


<xsl:param name="tokens" as="xs:string" select="'TITLE,PUBLICATION DATE,PUBLISHER,AUTHOR,TEXT'"/>
<xsl:param name="regex" as="xs:string" select="concat('^(', string-join(tokenize($tokens, ','), '|'), '):')"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output indent="yes"/>

<xsl:template match="d:Body">
  <xsl:copy>
    <xsl:for-each-group select="tokenize(., '\n')[normalize-space()]" group-starting-with=".[matches(., $regex)]">
      <d:Pre qualifier="{replace(., ':.*', '')}">
        <xsl:value-of select="current-group()" separator="&#10;"/>
      </d:Pre>
    </xsl:for-each-group>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>