Elasticsearch标准分析器停用词

时间:2015-06-16 09:35:49

标签: elasticsearch

我试图猜测elasticsearch中标准分析器中的默认停用词列表是什么。我运行版本1.3.1,在我看来,使用英文列表,因为运行这样的通配符查询

{
      "wildcard" : {
        "name" : {
          "wildcard" : "*in*"
        }
      }
}

没有给我任何结果(我确定文件名包含"在",并且在使用not_analyzed映射时会返回它们)。但是,在1.0 breaking changes上,他们说默认值现在为空,Standard Analyzer documentation for the latest version中也说明了相同的内容。另一方面,当点击给定链接以获取更多详细信息时,我最终到Stop Analyzer documentation,说默认仍然是英语。

任何帮助?感谢

1 个答案:

答案 0 :(得分:2)

这将是标准分析器的停用词列表:http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-common/4.9.0/org/apache/lucene/analysis/core/StopAnalyzer.java?av=f#50

50   static {
51     final List<String> stopWords = Arrays.asList(
52       "a", "an", "and", "are", "as", "at", "be", "but", "by",
53       "for", "if", "in", "into", "is", "it",
54       "no", "not", "of", "on", "or", "such",
55       "that", "the", "their", "then", "there", "these",
56       "they", "this", "to", "was", "will", "with"
57     );
58     final CharArraySet stopSet = new CharArraySet(Version.LUCENE_CURRENT, 
59         stopWords, false);
60     ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet); 
61   }

standard的{​​Elasticsearch'源代码:https://github.com/elastic/elasticsearch/blob/v1.3.1/src/main/java/org/elasticsearch/index/analysis/StandardAnalyzerProvider.java#L47

指向Lucene StandardAnalyzer的链接,后者依次引用StopAnalyzer的停用词列表:http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-common/4.9.0/org/apache/lucene/analysis/standard/StandardAnalyzer.java?av=f#63