我正在尝试在XSLT中创建一个字频率计数器。我希望它使用停用词。我开始使用Michael Kay's book。但是我无法让停止词起作用。
此代码适用于任何源XML文件。
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet
version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="stopwords" select="'a about an are as at be by for from how I in is it of on or that the this to was what when where who will with'"/>
<wordcount>
<xsl:for-each-group group-by="." select="
for $w in //text()/tokenize(., '\W+')[not(.=$stopwords)] return $w">
<word word="{current-grouping-key()}" frequency="{count(current-group())}"/>
</xsl:for-each-group>
</wordcount>
</xsl:template>
</xsl:stylesheet>
我认为not(.=$stopwords)
是我的问题所在。但我不知道该怎么做。
此外,我将提示如何从外部文件加载停用词。
答案 0 :(得分:2)
你的$ stopwords变量现在是一个字符串;你希望它是一个字符串序列。您可以通过以下任何方式执行此操作:
将其声明更改为
<xsl:variable name="stopwords"
select="('a', 'about', 'an', 'are', 'as', 'at',
'be', 'by', 'for', 'from', 'how',
'I', 'in', 'is', 'it',
'of', 'on', 'or',
'that', 'the', 'this', 'to',
'was', 'what', 'when', 'where',
'who', 'will', 'with')"/>
将其声明更改为
<xsl:variable name="stopwords"
select="tokenize('a about an are as at
be by for from how I in is it
of on or that the this to was
what when where who will with',
'\s+')"/>
从名为(例如)stoplist.xml的外部XML文档中读取,格式为
<stop-list>
<p>This is a sample stop list [further description ...]</p>
<w>a</w>
<w>about</w>
...
</stop-list>
然后加载它,例如与
<xsl:variable name="stopwords"
select="document('stopwords.xml')//w/string()"/>
答案 1 :(得分:1)
您正在将当前单词与所有停用词的整个列表进行比较,而应检查当前单词是否包含在停用词列表中:
not(contains(concat($stopwords,' '),concat(.,' '))
需要空间的连接以避免部分匹配 - 例如防止'abo'匹配'about'。