使用XSLT选择包含HTML标记的n个单词的摘要

时间:2010-01-27 07:59:43

标签: html xslt parsing

我想使用XSLT选择一个摘要以及HTML格式元素。以下是XML的示例:

<PUBLDES>The <IT>European Journal of Cancer (including EJC Supplements),</IT> 
is an international comprehensive oncology journal that publishes original 
research, editorial comments, review articles and news on experimental oncology, 
clinical oncology (medical, paediatric, radiation, surgical), translational 
oncology, and on cancer epidemiology and prevention. The Journal now has online
submission for authors. Please submit manuscripts at 
<SURL>http://ees.elsevier.com/ejc</SURL> and follow the instructions on the 
site.<P/>

The <IT>European Journal of Cancer (including EJC Supplements)</IT> is the 
official Journal of the European Organisation for Research and Treatment 
of Cancer (EORTC), the European CanCer Organisation (ECCO), the European 
Association for Cancer Research (EACR), the the European Society of Breast 
Cancer Specialists (EUSOMA) and the European School of Oncology (ESO). <P/>
Supplements to the <IT>European Journal of Cancer</IT> are published under 
the title <IT>EJC Supplements</IT> (ISSN 1359-6349).  All subscribers to 
<IT>European Journal of Cancer</IT> automatically receive this publication.<P/>
To access the latest tables of contents, abstracts and full-text articles 
from <IT>EJC</IT>, including Articles-in-Press, please visit <URL>
<HREF>http://www.sciencedirect.com/science/journal/09598049</HREF>
<HTXT>ScienceDirect</HTXT>
</URL>.</PUBLDES>

如何从中获取45个单词以及其中的HTML标记。当我使用substring()concat()时,会删除标记(例如<IT>等)。

1 个答案:

答案 0 :(得分:4)

以编程方式执行此操作可能会更好,而不是使用纯XSLT,但如果必须使用XSLT,则可以采用一种方法。它确实涉及多个样式表,但如果你能够使用扩展函数,你可以使用节点集,并将它们组合成一个大的(和讨厌的)样式表。

第一个样式表将复制初始XML,但“标记”它找到的任何文本,以便文本中的每个单词成为单独的“WORD”元素。

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <!-- Copy existing nodes and attributes -->
   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>
   <!-- Match text nodes -->
   <xsl:template match="text()">
      <xsl:call-template name="tokenize">
         <xsl:with-param name="string" select="."/>
      </xsl:call-template>
   </xsl:template>
   <!-- Splits a string into separate elements for each word -->
   <xsl:template name="tokenize">
      <xsl:param name="string"/>
      <xsl:param name="delimiter" select="' '"/>
      <xsl:choose>
         <xsl:when test="$delimiter and contains($string, $delimiter)">
            <xsl:variable name="word" select="normalize-space(substring-before($string, $delimiter))"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
            <xsl:call-template name="tokenize">
               <xsl:with-param name="string" select="substring-after($string, $delimiter)"/>
               <xsl:with-param name="delimiter" select="$delimiter"/>
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="word" select="normalize-space($string)"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

用于“标记”一串文本的XSLT模板,我在这里提出了这个问题:

tokenizing-and-sorting-with-xslt-1-0

(注意,在XSLT2.0中,我相信有一个tokenize函数,可以简化上述内容)

这会给你这样的XML ......

<PUBLDES>
   <WORD>The</WORD>
   <IT>
      <WORD>European</WORD>
      <WORD>Journal</WORD>
      <WORD>of</WORD>
      ....

等等......

接下来,将使用另一个XSLT文档遍历此XML文档,仅输出前45个单词元素。为此,我重复应用一个模板,保持当前找到的WORDS数量的总计。匹配节点时,有三种可能性

  • 匹配WORD元素:输出它。如果未达到总数,则从下一个兄弟进行处理。
  • 匹配其下方的字数小于总数的元素:复制整个元素,然后在未达到总数的情况下从下一个兄弟继续处理
  • 匹配以下单词数量超过总数的元素:复制当前节点(但不包括其子节点)并继续处理第一个孩子。

这是样式表,其所有的可怕性

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:variable name="WORDCOUNT">6</xsl:variable>

   <!-- Match root element -->
   <xsl:template match="/">
      <xsl:apply-templates select="descendant::*[1]" mode="word">
         <xsl:with-param name="previousWords">0</xsl:with-param>
      </xsl:apply-templates>
   </xsl:template>

   <!-- Match any node -->
   <xsl:template match="node()" mode="word">
      <xsl:param name="previousWords"/>

      <!-- Number of words below the element (at any depth) -->
      <xsl:variable name="childWords" select="count(descendant::WORD)"/>
      <xsl:choose>
         <!-- Matching a WORD element -->
         <xsl:when test="local-name(.) = 'WORD'">
            <!-- Copy the word -->
            <WORD>
               <xsl:value-of select="."/>
            </WORD>
            <!-- If there are still words to output, continue processing at next sibling -->
            <xsl:if test="$previousWords + 1 &lt; $WORDCOUNT">
               <xsl:apply-templates select="following-sibling::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords + 1"/>
                  </xsl:with-param>
               </xsl:apply-templates>
            </xsl:if>
         </xsl:when>

         <!-- Match a node where the number of words below it is within allowed limit -->
         <xsl:when test="$childWords &lt;= $WORDCOUNT - $previousWords">
            <!-- Copy the element -->
            <xsl:copy>
               <!-- Copy all its desecendants -->
               <xsl:copy-of select="*|@*"/>
            </xsl:copy>
            <!-- If there are still words to output, continue processing at next sibling -->
            <xsl:if test="$previousWords + $childWords &lt; $WORDCOUNT">
               <xsl:apply-templates select="following-sibling::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords + $childWords"/>
                  </xsl:with-param>
            </xsl:apply-templates>
         </xsl:if>
         </xsl:when>

         <!-- Match nodes where the number of words below it would exceed current limit -->
         <xsl:otherwise>
            <!-- Copy the node -->
            <xsl:copy>
               <!-- Continue processing at very first child node -->
               <xsl:apply-templates select="descendant::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords"/>
                  </xsl:with-param>
               </xsl:apply-templates>
            </xsl:copy>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

如果你只输出前4个单词,比如说,这会给你以下输出

<PUBLDES>
   <WORD>The</WORD>
   <IT>
      <WORD>European</WORD>
      <WORD>Journal</WORD>
      <WORD>of</WORD>
   </IT>
</PUBLDES>

当然,您需要另一个转换来删除WORD元素,然后保留文本。这应该是相当直接的......

这一切都非常讨厌,但这是我现在能想到的最好的事情!