Question

我正在运行Apache Solr 6.6.5。当用户搜索“ ETCS”（特殊技术术语）时，所有文档都是包含单词“ etc”的匹配项。但我只想匹配真正包含“ ETCS”的文档。 Solr绝对不要索引“ etc”，因为它是一个非常普遍的词。词干永远不要将“ etc”变成“ etcs”（复数词干）。

我在stopwords.txt中添加了“等”：

# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc

我在protwords.txt中添加了“等”：

#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&amp;
&lt;
&gt;
&#039;
&quot;
etc

这有助于不匹配包含“ etc”的文档，但是仍匹配包含“ etc。”，“ etc”或类似内容的文档。

所以我可以在protwords.txt中添加更多变体：

&amp;
&lt;
&gt;
&#039;
&quot;
etc
etc.
etc..
etc...
etc,

但是那将永远是不完整的。我如何告诉词干将“ etc”视为带有任意非单词字符的标记化单词？

我的schema.xml：https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945

Answer 1

1。）

我在protwords.txt中添加了“等”：

您应该在词尾添加etcs，以保护词条etcs的词干。

2。）

所以我可以在protwords.txt中添加更多变体：

将要从索引中删除的所有单词变体添加到stopwords.txt中，而不是protwords.txt

3。）检查您使用的是哪种文件类型。也许您可以在这里进行调整

//编辑：只要您不解释正在使用的字段，就不会为您的schema.xml添加链接。

4。）不要忘记重启并（如果需要）为索引重新编制索引。

如何从Apache Solr索引中完全删除单词？

1 个答案: