XSLT 2.0:通过已知的语义层次结构从元素的文本值创建子元素

时间:2013-10-28 20:21:46

标签: xml xslt xslt-2.0

有点卡在这一个上。数据以下列格式提供(非重要内容剪辑):

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

需要获取包含语义层次结构的元素内容的文本值:

(1)
 +-(a)
    +-(I)
       +-(A)

...并通过XSLT 2.0转换作为父子元素关系作为最终输出:

    <?xml version="1.0" encoding="UTF-8"?>
    <law>
       <!--SNIP-->
       <content>
          <section prefix="(1)">
            <section prefix="(a)">The statutes ...
            <section prefix="(b)">To ensure public ..:
              <section prefix="(I)">Shall authorize ...;</section>
              <section prefix="(II)">May authorize and ...:
                <section prefix="(A)">Compact disks;</section>
                <section prefix="(B)">On-line public ...;</section>
                <section prefix="(C)">Electronic applications for ..;</section>
                <section prefix="(D)">Electronic books or ...</section>
                <section prefix="(E)">Other electronic products or formats;</section>
              </section>
              <section prefix="(III)">May, pursuant ...</section>
              <section prefix="(IV)">Recognizes that ...</section>        
            </section>      
          </section>
          <section prefix="(2)">
            <section prefix="(a)">Any person, ...:
              <section prefix="(I)">A statement specifying ...;</section>
              <section prefix="(II)">A statement specifying ...;</section>
            </section>      
          </section>
          <section prefix="(3)">Level 1 node with no children</section>
       </content>
    </law>

我能够从Content的文本值中标记出结束的html编码的P标记,但不知道如何获得动态创建的元素以在条件句上创建子元素。

我的XSLT 2.0样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">
        <content>
            <!-- Loop through HTML encoded P tag endings -->
            <xsl:for-each select="tokenize(.,'&lt;/p&gt;')">

                <!-- Set Token to a variable and remove P opening tags -->
                <xsl:variable name="sectionText">
                    <xsl:value-of select="normalize-space(replace(current(),'&lt;p&gt;',''))"/>  
                </xsl:variable>    

                <!-- Output -->
                <xsl:if test="string-length($sectionText)!=0">
                    <section>
                        <!-- Set the section element's prefix attribute (if exists) -->
                        <xsl:analyze-string select="$sectionText" regex="^(\(([\w]+)\)){{1,3}}">
                            <xsl:matching-substring >
                                <xsl:attribute name="prefix" select="." />
                            </xsl:matching-substring>
                        </xsl:analyze-string>

                        <!-- Set the section element's value -->
                        <xsl:value-of select="$sectionText"/>
                    </section>
                </xsl:if>

            </xsl:for-each>
        </content>
    </xsl:template>
</xsl:stylesheet> 

...让我离开这么远 - 在部分元素中没有语义层次结构:

<?xml version="1.0" encoding="UTF-8"?>
<law>
   <structure>
      <content>
         <section prefix="(1)(a)">(1)(a)The statutes ...</section>
         <section prefix="(b)">(b)To ensure public ..:</section>
         <section prefix="(I)">(I)Shall authorize ...;</section>
         <section prefix="(II)">(II)May authorize and ...:</section>
         <section prefix="(A)">(A)Compact disks;</section>
         <section prefix="(B)">(B)On-line public ...;</section>
         <section prefix="(C)">(C)Electronic applications for ..;</section>
         <section prefix="(D)">(D)Electronic books or ...</section>
         <section prefix="(E)">(E)Other electronic products or formats;</section>
         <section prefix="(III)">(III)May, pursuant ...</section>
         <section prefix="(IV)">(IV)Recognizes that ...</section>
         <section prefix="(2)(a)">(2)(a)Any person, ...:</section>
         <section prefix="(I)">(I)A statement specifying ...;</section>
         <section prefix="(II)">(II)A statement specifying ...;</section>
         <section prefix="(3)">(3)Level 1 section with no children ...;</section>
      </content>
   </structure>
</law>

由于XSLT 2.0样式表通过对末尾P标记进行标记来动态动态创建 Section 元素,因此如何构建父子关系< / strong> 动态 通过前缀属性使用已知的语义层次结构?

其他编程语言经验指向我基于前缀前缀的标记化和逻辑的递归方向 - 很难找到有关如何使用我对v2.0的有限XSLT知识执行此操作的任何信息(使用v1.0差不多10年以前)。我知道我可以用外部Python脚本解析并完成,但是试图坚持使用XSLT 2.0样式表解决方案以实现可维护性。

感谢任何帮助,让我走上正确的道路和/或解决方案。

2 个答案:

答案 0 :(得分:2)

您已经解决了问题的一个棘手阶段,即使用以下元素创建中间输出:

<section prefix="(1)(a)">text</section>

我的下一步是计算一个级别编号,所以它看起来像这样:

<section level="1" prefix="(1)(a)">text</section>

计算等级编号只是看到前缀匹配的几个正则表达式中的哪一个的问题:(1)给出等级1,(b)给出等级2,等等。

获得级别编号后,您可以使用本文所述的递归位置分组:http://www.saxonica.com/papers/ideadb-1.1/mhk-paper.xml

答案 1 :(得分:1)

我玩了一下这个,并提出了以下样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next))">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

它利用http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl(一种用XSLT 2.0编写的HTML标记汤解析器)将转义的HTML片段标记解析为节点,然后使用样式表中的函数mf:group对其进行分组。分组由作为参数传递的一系列正则表达式模式驱动。

将Saxon 9.5样式表应用于输入样本时,我得到了结果

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
      </content>
   </structure>
</law>

您需要使用罗马数字的正则表达式模式编辑参数,以便列出更多数字(如果可以有超过13个(XIII)部分,因为我目前只列出了包含XIII的数字。

根据评论和编辑的输入样本,我稍微调整了样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next)) and matches(., $patterns[2]/@value)">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

现在它转换

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
         <section prefix="(3)">A statement
            specifying ...; </section>
         <section prefix="(4)">A statement
            specifying ...; </section>
      </content>
   </structure>
</law>