XSLT - 如何创建具有出现次数的单词表,并按频率递减的顺序对其进行排序

时间:2016-11-06 21:23:17

标签: xml xslt

我的XML文件:

<bncDoc xml:id="KS0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title> Oxford City Council Health and Environmental Protection Committee meeting. Sample containing about 11223 words speech recorded in public context 
                </title>
                <respStmt>
                    <resp> Data capture and transcription 
                    </resp>
                    <name> Oxford University Press 
                    </name>
                </respStmt>
            </titleStmt>
            <editionStmt>
                <edition>BNC XML Edition, December 2006
                </edition>
            </editionStmt>
            <extent> 11223 tokens; 11688 w-units; 482 s-units 
            </extent>
            <publicationStmt>
                <distributor>Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.
                </distributor>
                <availability> This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions.
                </availability>
                <idno type="bnc">KS0
                </idno>
                <idno type="old"> OCCEnv 
                </idno>
            </publicationStmt>
            <sourceDesc>
                <recordingStmt>
                    <recording n="139401" type="DAT"/>
                </recordingStmt>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <tagsDecl>
                <namespace name="">
                    <tagUsage gi="align" occurs="69"/>
                    <tagUsage gi="c" occurs="1408"/>
                    <tagUsage gi="div" occurs="1"/>
                    <tagUsage gi="event" occurs="3"/>
                    <tagUsage gi="mw" occurs="110"/>
                    <tagUsage gi="pause" occurs="2"/>
                    <tagUsage gi="s" occurs="482"/>
                    <tagUsage gi="u" occurs="192"/>
                    <tagUsage gi="unclear" occurs="65"/>
                    <tagUsage gi="vocal" occurs="7"/>
                    <tagUsage gi="w" occurs="11688"/>
                </namespace>
            </tagsDecl>
        </encodingDesc>
        <profileDesc>
            <creation date="0000">0000-00-00 Origination/creation date not known 
            </creation>
            <particDesc n="C872">
                <person ageGroup="X" xml:id="PS6H7" role="unspecified" sex="f" soc="AB" dialect="NONE" educ="X">
                    <persName>Chair
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6H8" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>g
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6H9" role="unspecified" sex="f" soc="UU" dialect="NONE" educ="X">
                    <persName>chair2
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HA" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>i
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HB" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>h
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HC" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>foe
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HD" role="unspecified" sex="f" soc="UU" dialect="NONE" educ="X">
                    <persName>b
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HE" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>a
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HF" role="unspecified" sex="f" soc="UU" dialect="NONE" educ="X">
                    <persName>ei
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HG" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>bp
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HH" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>c
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HJ" role="unspecified" sex="m" soc="UU" dialect="NONE" educ="X">
                    <persName>d
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HK" role="unspecified" sex="f" soc="UU" dialect="NONE" educ="X">
                    <persName>e
                    </persName>
                </person>
                <person ageGroup="X" xml:id="PS6HL" role="unspecified" sex="u" soc="UU" dialect="NONE" educ="X">
                    <persName>d
                    </persName>
                </person>
            </particDesc>
            <settingDesc>
                <setting n="OCCEnv" who="PS6H7 PS6H8 PS6H9 PS6HA PS6HB PS6HC PS6HD PS6HE PS6HF PS6HG PS6HH PS6HJ PS6HK PS6HL">
                    <placeName>Oxfordshire:  Oxford 
                    </placeName>
                    <activity> Council Committee Meeting 
                    </activity>
                </setting>
            </settingDesc>
            <textClass>
                <catRef targets="SPO ALLTIM3 ALLAVA0 ALLTYP2 SCGDOM3 SPOLOG2 SPOREG1"/>
                <classCode scheme="DLEE">S meeting
                </classCode>
                <keywords>
                    <term> (none) 
                    </term>
                </keywords>
            </textClass>
        </profileDesc>
        <revisionDesc>
            <change date="2006-10-21" who="#OUCS">Tag usage updated for BNC-XML
            </change>
            <change date="2000-12-13" who="#OUCS">Last check for BNC World first release
            </change>
            <change date="2000-09-06" who="#OUCS">Redo tagusage tables
            </change>
            <change date="2000-09-01" who="#OUCS">Check all tagcounts
            </change>
            <change date="2000-06-23" who="#OUCS">Resequenced s-units and added headers
            </change>
            <change date="2000-01-29" who="#OUCS">Revised participant details
            </change>
            <change date="2000-01-21" who="#OUCS">Added date info
            </change>
            <change date="2000-01-09" who="#OUCS">Updated all catrefs
            </change>
            <change date="2000-01-09" who="#OUCS">Updated REC elements to include tape number
            </change>
            <change date="2000-01-08" who="#OUCS">Updated titles
            </change>
            <change date="1999-12-25" who="#OUCS">corrected tagUsage
            </change>
            <change date="1999-09-21" who="#UCREL">POS codes revised for BNC-2; header updated
            </change>
            <change date="1994-11-27" who="#dominic">Initial accession to corpus
            </change>
        </revisionDesc>
    </teiHeader>
    <stext type="OTHERSP">
        <div><!--
Oxford City Council: Health and Environmental Protection Committee (Nuclear Issues and Pollution Control) Sub-Committee.
Wednesday, 18th April 1990, 2.30pm, Town Hall.--><u who="PS6H7">
                <s n="3">
                    <w c5="AV0" hw="well" pos="ADV">Well
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="AJ0" hw="good" pos="ADJ">good 
                    </w>
                    <w c5="NN1" hw="afternoon" pos="SUBST">afternoon
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="PNI" hw="everybody" pos="PRON">everybody
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="PNP" hw="i" pos="PRON">I 
                    </w>
                    <w c5="VVB" hw="think" pos="VERB">think 
                    </w>
                    <w c5="PNP" hw="we" pos="PRON">we
                    </w>
                    <w c5="VHD" hw="have" pos="VERB">'d 
                    </w>
                    <w c5="AV0" hw="well" pos="ADV">better 
                    </w>
                    <w c5="VVI" hw="get" pos="VERB">get 
                    </w>
                    <w c5="VVN" hw="start" pos="VERB">started
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="4">
                    <w c5="PNP" hw="we" pos="PRON">We 
                    </w>
                    <w c5="VVD" hw="look" pos="VERB">looked 
                    </w>
                    <w c5="AV0" hw="so" pos="ADV">so 
                    </w>
                    <w c5="AJ0" hw="thin" pos="ADJ">thin 
                    </w>
                    <w c5="PRP" hw="on" pos="PREP">on 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="NN1" hw="ground" pos="SUBST">ground
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="PNP" hw="i" pos="PRON">I 
                    </w>
                    <w c5="VVD" hw="think" pos="VERB">thought 
                    </w>
                    <w c5="PNP" hw="we" pos="PRON">we
                    </w>
                    <w c5="VM0" hw="would" pos="VERB">'d 
                    </w>
                    <w c5="VVI" hw="sit" pos="VERB">sit 
                    </w>
                    <w c5="CJC" hw="and" pos="CONJ">and 
                    </w>
                    <w c5="VVI" hw="wait" pos="VERB">wait 
                    </w>
                    <w c5="CJC" hw="and" pos="CONJ">and 
                    </w>
                    <w c5="VVI" hw="see" pos="VERB">see 
                    </w>
                    <w c5="CJS" hw="if" pos="CONJ">if 
                    </w>
                    <w c5="PNI" hw="everyone" pos="PRON">everyone
                    </w>
                    <w c5="VBZ" hw="be" pos="VERB">'s 
                    </w>
                    <w c5="VVG-AJ0" hw="come" pos="VERB">coming
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="CJC" hw="but" pos="CONJ">but 
                    </w>
                    <w c5="UNC" hw="erm" pos="UNC">erm 
                    </w>
                    <w c5="PNP" hw="we" pos="PRON">we
                    </w>
                    <w c5="VM0" hw="will" pos="VERB">'ll 
                    </w>
                    <w c5="VHI" hw="have" pos="VERB">have 
                    </w>
                    <w c5="TO0" hw="to" pos="PREP">to 
                    </w>
                    <w c5="VVI" hw="get" pos="VERB">get 
                    </w>
                    <w c5="VVN" hw="start" pos="VERB">started 
                    </w>
                    <w c5="AV0" hw="anyway" pos="ADV">anyway
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="5">
                    <w c5="PNP" hw="we" pos="PRON">We
                    </w>
                    <w c5="VM0" hw="will" pos="VERB">'ll 
                    </w>
                    <w c5="VVI" hw="welcome" pos="VERB">welcome
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="PNP" hw="we" pos="PRON">we 
                    </w>
                    <w c5="VHB" hw="have" pos="VERB">have 
                    </w>
                    <w c5="CRD" hw="two" pos="ADJ">two 
                    </w>
                    <w c5="NN2" hw="speaker" pos="SUBST">speakers
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="NP0" hw="mr" pos="SUBST">Mr 
                    </w>
                    <w c5="NP0" hw="bob" pos="SUBST">Bob 
                    </w>
                    <w c5="NP0" hw="plumtree" pos="SUBST">Plumtree
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="CJC" hw="and" pos="CONJ">and 
                    </w>
                    <w c5="NP0" hw="ms" pos="SUBST">Ms 
                    </w>
                    <w c5="NP0" hw="erica" pos="SUBST">Erica 
                    </w>
                    <w c5="NP0" hw="ison" pos="SUBST">Ison
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="6">
                    <w c5="PNP" hw="we" pos="PRON">We 
                    </w>
                    <w c5="VVD" hw="ask" pos="VERB">asked 
                    </w>
                    <w c5="PNP" hw="they" pos="PRON">them 
                    </w>
                    <w c5="PRP" hw="to" pos="PREP">to 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="NN1" hw="meeting" pos="SUBST">meeting 
                    </w>
                    <w c5="CJC" hw="and" pos="CONJ">and 
                    </w>
                    <w c5="PNP" hw="we" pos="PRON">we 
                    </w>
                    <w c5="VVB" hw="look" pos="VERB">look 
                    </w>
                    <w c5="AV0" hw="forward" pos="ADV">forward 
                    </w>
                    <w c5="PRP" hw="to" pos="PREP">to 
                    </w>
                    <w c5="VVG-NN1" hw="listen" pos="VERB">listening 
                    </w>
                    <w c5="PRP" hw="to" pos="PREP">to 
                    </w>
                    <w c5="PNP" hw="you" pos="PRON">you 
                    </w>
                    <w c5="AV0" hw="later" pos="ADV">later 
                    </w>
                    <w c5="AVP" hw="on" pos="ADV">on 
                    </w>
                    <w c5="PRP" hw="in" pos="PREP">in 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="NN1" hw="agenda" pos="SUBST">agenda
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="7">
                    <w c5="AT0" hw="the" pos="ART">The 
                    </w>
                    <w c5="NN2" hw="minute" pos="SUBST">minutes 
                    </w>
                    <w c5="PRF" hw="of" pos="PREP">of 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="NN1" hw="meeting" pos="SUBST">meeting 
                    </w>
                    <w c5="VVD-VVN" hw="hold" pos="VERB">held 
                    </w>
                    <w c5="PRP" hw="in" pos="PREP">in 
                    </w>
                    <w c5="NP0" hw="january" pos="SUBST">January
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="8">
                    <w c5="DT0" hw="any" pos="ADJ">Any 
                    </w>
                    <w c5="NN2" hw="correction" pos="SUBST">corrections 
                    </w>
                    <w c5="PRP" hw="to" pos="PREP">to 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="NN2" hw="minute" pos="SUBST">minutes 
                    </w>
                    <w c5="ORD" hw="first" pos="ADJ">first
                    </w>
                    <c c5="PUN">?
                    </c>
                </s>
                <s n="9">
                    <w c5="NN1-VVB" hw="page" pos="SUBST">Page 
                    </w>
                    <w c5="CRD" hw="1" pos="ADJ">1
                    </w>
                    <c c5="PUN">?
                    </c>
                </s>
                <s n="483">
                    <w c5="EX0" hw="there" pos="PRON">There 
                    </w>
                    <w c5="VBZ" hw="be" pos="VERB">is 
                    </w>
                    <w c5="AT0" hw="a" pos="ART">a 
                    </w>
                    <w c5="NN1" hw="school" pos="SUBST">school 
                    </w>
                    <w c5="PRP" hw="in" pos="PREP">in 
                    </w>
                    <w c5="NP0" hw="ferry" pos="SUBST">Ferry 
                    </w>
                    <w c5="NP0" hw="hinksey" pos="SUBST">Hinksey 
                    </w>
                    <w c5="NP0" hw="road" pos="SUBST">Road 
                    </w>
                    <w c5="VBZ" hw="be" pos="VERB">is
                    </w>
                    <w c5="XX0" hw="not" pos="ADV">n't 
                    </w>
                    <w c5="EX0" hw="there" pos="PRON">there
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="AT0" hw="a" pos="ART">a 
                    </w>
                    <w c5="AJ0" hw="middle" pos="ADJ">middle 
                    </w>
                    <w c5="NN1" hw="school" pos="SUBST">school 
                    </w>
                    <w c5="PNP" hw="i" pos="PRON">I 
                    </w>
                    <w c5="VVB" hw="think" pos="VERB">think
                    </w>
                    <c c5="PUN">, 
                    </c>
                    <w c5="AV0" hw="so" pos="ADV">so 
                    </w>
                    <w c5="DT0" hw="that" pos="ADJ">that
                    </w>
                    <w c5="VBZ" hw="be" pos="VERB">'s 
                    </w>
                    <w c5="AT0" hw="the" pos="ART">the 
                    </w>
                    <w c5="AJ0" hw="only" pos="ADJ">only 
                    </w>
                    <w c5="PNI" hw="one" pos="PRON">one 
                    </w>
                    <w c5="PNP" hw="i" pos="PRON">I 
                    </w>
                    <w c5="VVB" hw="know" pos="VERB">know
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
                <s n="484">
                    <w c5="AT0" hw="the" pos="ART">The 
                    </w>
                    <w c5="NN1" hw="thing" pos="SUBST">thing 
                    </w>
                    <w c5="PNP" hw="i" pos="PRON">I
                    </w>
                    <w c5="VM0" hw="would" pos="VERB">'d 
                    </w>
                    <w c5="AV0" hw="really" pos="ADV">really 
                    </w>
                    <w c5="VVI" hw="like" pos="VERB">like 
                    </w>
                    <w c5="VBZ" hw="be" pos="VERB">is 
                    </w>
                    <w c5="AT0" hw="a" pos="ART">a 
                    </w>
                    <w c5="NN1" hw="glossary" pos="SUBST">glossary 
                    </w>
                    <w c5="PRF" hw="of" pos="PREP">of 
                    </w>
                    <w c5="NN2" hw="term" pos="SUBST">terms
                    </w>
                    <c c5="PUN">.
                    </c>
                </s>
            </u>
        </div>
    </stext>
</bncDoc>

如何创建一个包含它们发生次数的单词表,并按频率递减的顺序对它们进行排序?

1 个答案:

答案 0 :(得分:0)

您可以通过以下方式计算输入文档中的单词:

  1. 获取所有文本节点(变量txtNodes)。出于性能原因,您可以限制 对包含其他东西的节点的选择不仅仅是&#34; white&#34;人物 (normalize-space())。

  2. 提取单个字词并保存在words变量中。

  3. 将这些字词按其内容的大写字母(for-each-group)分组。

  4. 每组打印:

    • 这个词(就像在字典中,第一个字母在上面,其余部分在下面)。
    • 出现次数(count(current-group()))。
    • 按字母顺序对出现次数和(作为二阶键)排序打印输出。
  5. 下面是示例XSLT代码:

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <xsl:output method="text"/>
    
      <xsl:template match="/">
        <xsl:variable name="txtNodes" select="//text()[normalize-space()]" as="xs:string*"/>
        <xsl:variable name="words" as="xs:string*">
          <xsl:for-each select="$txtNodes">
            <xsl:analyze-string select="." regex="\w+">
              <xsl:matching-substring>
                <xsl:value-of select="."/>
              </xsl:matching-substring>
            </xsl:analyze-string>
          </xsl:for-each>
        </xsl:variable>
        <xsl:text>Words / # of occurrences:&#xA;</xsl:text>
        <xsl:for-each-group select="$words" group-by="upper-case(.)">
          <xsl:sort select="count(current-group())" data-type="number" order="descending"/>
          <xsl:sort select="upper-case(.)"/>
          <xsl:value-of select="concat(upper-case(substring(., 1, 1)), lower-case(substring(., 2)))"/>
          <xsl:text> - </xsl:text>
          <xsl:value-of select="count(current-group())"/>
          <xsl:text>&#xA;</xsl:text>
        </xsl:for-each-group>
      </xsl:template>
    </xsl:stylesheet>