XSLT - 如何创建具有出现次数的单词表,并按频率递减的顺序对其进行排序

时间:2016-11-06 21:23:17

标签: xml xslt


Oxford City Council: Health and Environmental Protection Committee (Nuclear Issues and Pollution Control) Sub-Committee.
                <s n="483">
                    <w c5="EX0" hw="there" pos="PRON">There 
                    <w c5="VBZ" hw="be" pos="VERB">is 
                    <w c5="AT0" hw="a" pos="ART">a 
                    <w c5="NN1" hw="school" pos="SUBST">school 
                    <w c5="PRP" hw="in" pos="PREP">in 
                    <w c5="NP0" hw="ferry" pos="SUBST">Ferry 
                    <w c5="NP0" hw="hinksey" pos="SUBST">Hinksey 
                    <w c5="NP0" hw="road" pos="SUBST">Road 
                    <w c5="VBZ" hw="be" pos="VERB">is
                    <w c5="XX0" hw="not" pos="ADV">n't 
                    <w c5="EX0" hw="there" pos="PRON">there
                    <c c5="PUN">, 
                    <w c5="AT0" hw="a" pos="ART">a 
                    <w c5="AJ0" hw="middle" pos="ADJ">middle 
                    <w c5="NN1" hw="school" pos="SUBST">school 
                    <w c5="PNP" hw="i" pos="PRON">I 
                    <w c5="VVB" hw="think" pos="VERB">think
                    <c c5="PUN">, 
                    <w c5="AV0" hw="so" pos="ADV">so 
                    <w c5="DT0" hw="that" pos="ADJ">that
                    <w c5="VBZ" hw="be" pos="VERB">'s 
                    <w c5="AT0" hw="the" pos="ART">the 
                    <w c5="AJ0" hw="only" pos="ADJ">only 
                    <w c5="PNI" hw="one" pos="PRON">one 
                    <w c5="PNP" hw="i" pos="PRON">I 
                    <w c5="VVB" hw="know" pos="VERB">know
                    <c c5="PUN">.
                <s n="484">
                    <w c5="AT0" hw="the" pos="ART">The 
                    <w c5="NN1" hw="thing" pos="SUBST">thing 
                    <w c5="PNP" hw="i" pos="PRON">I
                    <w c5="VM0" hw="would" pos="VERB">'d 
                    <w c5="AV0" hw="really" pos="ADV">really 
                    <w c5="VVI" hw="like" pos="VERB">like 
                    <w c5="VBZ" hw="be" pos="VERB">is 
                    <w c5="AT0" hw="a" pos="ART">a 
                    <w c5="NN1" hw="glossary" pos="SUBST">glossary 
                    <w c5="PRF" hw="of" pos="PREP">of 
                    <w c5="NN2" hw="term" pos="SUBST">terms
                    <c c5="PUN">.


  1. 获取所有文本节点(变量txtNodes)。出于性能原因,您可以限制 对包含其他东西的节点的选择不仅仅是&#34; white&#34;人物 (normalize-space())。

  2. 提取单个字词并保存在words变量中。

  3. 将这些字词按其内容的大写字母(for-each-group)分组。

  4. 每组打印:

    • 这个词(就像在字典中,第一个字母在上面,其余部分在下面)。
    • 出现次数(count(current-group()))。
    • 按字母顺序对出现次数和(作为二阶键)排序打印输出。
  5. 下面是示例XSLT代码:

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      <xsl:output method="text"/>
      <xsl:template match="/">
        <xsl:variable name="txtNodes" select="//text()[normalize-space()]" as="xs:string*"/>
        <xsl:variable name="words" as="xs:string*">
          <xsl:for-each select="$txtNodes">
            <xsl:analyze-string select="." regex="\w+">
                <xsl:value-of select="."/>
        <xsl:text>Words / # of occurrences:&#xA;</xsl:text>
        <xsl:for-each-group select="$words" group-by="upper-case(.)">
          <xsl:sort select="count(current-group())" data-type="number" order="descending"/>
          <xsl:sort select="upper-case(.)"/>
          <xsl:value-of select="concat(upper-case(substring(., 1, 1)), lower-case(substring(., 2)))"/>
          <xsl:text> - </xsl:text>
          <xsl:value-of select="count(current-group())"/>