我有一个非结构化的XML文档(取自Pandoc转换的docx到docbook格式),我试图用XSLT清理它。 xml的格式是这样的;
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<article>
<articleinfo>
<title></title>
</articleinfo>
<informaltable>
<tgroup cols="2">
<colspec align="left" />
<colspec align="left" />
<thead>
<row>
<entry>
<emphasis role="strong">How did you assist
Customer?</emphasis>
</entry>
<entry>
<emphasis>Lorem ipsum dolor sit amet.</emphasis>
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
<emphasis role="strong">What difference did this make for the
Customer?</emphasis>
</entry>
<entry>
<emphasis>Lorem ipsum dolor sit amet.</emphasis>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>
Staff Member: John Smith
</para>
<informaltable>
<tgroup cols="2">
<colspec align="left" />
<colspec align="left" />
<thead>
<row>
<entry>
<emphasis role="strong">How did you assist
Customer?</emphasis>
</entry>
<entry>
<emphasis>Lorem ipsum dolor sit amet.</emphasis>
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
<emphasis role="strong">What difference did this make for the
Customer?</emphasis>
</entry>
<entry>
<emphasis>Lorem ipsum dolor sit amet.</emphasis>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>
Staff Member: John Smith
</para>
<informaltable>
<tgroup cols="2">
<colspec align="left" />
<colspec align="left" />
<thead>
<row>
<entry>
<emphasis role="strong">How did you assist
Customer?</emphasis>
</entry>
<entry>
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
<emphasis role="strong">What difference did this make for the
Customer?</emphasis>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
<row>
<entry>
</entry>
<entry>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>
Staff Member: _________________________
</para>
</article>
我已经使用以下XSLT成功地修改了它;
<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:variable name="fileDateStamp">
<xsl:analyze-string select="base-uri(.)" regex="\s*(\d\d\d\d\-\d\d\-\d\d)\s*">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:template match="/">
<impactStatements>
<xsl:apply-templates/>
</impactStatements>
</xsl:template>
<xsl:template match="informaltable/tgroup/thead/row/entry">
<xsl:analyze-string select="normalize-space(.)" regex="\s*How(.*)\s*">
<xsl:matching-substring>
</xsl:matching-substring>
<xsl:non-matching-substring>
<Assisted>
<xsl:value-of select="(.)"/>
</Assisted>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="informaltable/tgroup/tbody/row/entry">
<xsl:analyze-string select="normalize-space(.)" regex="\s*What(.*)\s*">
<xsl:matching-substring>
</xsl:matching-substring>
<xsl:non-matching-substring>
<Difference>
<xsl:value-of select="(.)"/>
</Difference>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="para">
<xsl:analyze-string select="normalize-space(.)" regex="\s*\Staff Member: ([A-Z].*)\s*">
<xsl:matching-substring>
<Staff><xsl:value-of select="regex-group(1)"/></Staff>
<DateCreated><xsl:value-of select="$fileDateStamp"/></DateCreated>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
但我缺少的是能够在每个'记录'周围添加标签。由于<informaltable>
和<para>
都是<article>
的孩子,我最基本的XSLT知识完全让我失望。我得到了
<?xml version="1.0" encoding="UTF-8"?>
<impactStatements>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated>2014-01-01</DateCreated>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated>2014-01-01</DateCreated>
</impactStatements>
但我想要;
<?xml version="1.0" encoding="UTF-8"?>
<impactStatements>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated>2014-01-01</DateCreated>
</statement>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated>2014-01-01</DateCreated>
</statement>
</impactStatements>
这是一个一次性的工作,我知道我可以通过其他方式更改XML,但我确信我只是缺乏一些基本知识来改变XSLT我必须按照我的意愿去做。我尝试过各种不同的方法并用Google搜索,但无济于事。我尝试过的所有内容都会破坏我生成的XML的格式。
答案 0 :(得分:2)
我首先要添加一个模板
<xsl:template match="article">
<xsl:for-each-group select="*" group-starting-with="informaltable">
<statement>
<xsl:apply-templates select="current-group()"/>
</statement>
</xsl:for-each-group>
</xsl:template>
对于您的样本(以及在添加<xsl:strip-space elements="*"/>
以提高可读性之后),我得到了输出
<impactStatements>
<statement/>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated/>
</statement>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated/>
</statement>
<statement/>
</impactStatements>
我不确定空statement
元素是否是由缺少样本数据引起的,或者您是否要从处理中排除某些元素,您需要解释输入中哪些元素应该创建结果{ {1}}。
答案 1 :(得分:1)
一个有趣且问题很好的问题!将匹配/
的模板更改为
<xsl:template match="/article">
<impactStatements>
<xsl:for-each select="informaltable">
<statement>
<xsl:apply-templates select=". | following-sibling::*[self::para][1]"/>
</statement>
</xsl:for-each>
</impactStatements>
</xsl:template>
结果是:
<?xml version="1.0" encoding="UTF-8"?>
<impactStatements>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated/>
</statement>
<statement>
<Assisted>Lorem ipsum dolor sit amet.</Assisted>
<Difference>Lorem ipsum dolor sit amet.</Difference>
<Staff>John Smith</Staff>
<DateCreated/>
</statement>
<statement/>
</impactStatements>
我认为这几乎是正确的。最后有一个空的statement
,因为输入中有3个informaltable
个元素。你想怎么处理它?</ p>