我试图猜测elasticsearch中标准分析器中的默认停用词列表是什么。我运行版本1.3.1,在我看来,使用英文列表,因为运行这样的通配符查询
{
"wildcard" : {
"name" : {
"wildcard" : "*in*"
}
}
}
没有给我任何结果(我确定文件名包含"在",并且在使用not_analyzed映射时会返回它们)。但是,在1.0 breaking changes上,他们说默认值现在为空,Standard Analyzer documentation for the latest version中也说明了相同的内容。另一方面,当点击给定链接以获取更多详细信息时,我最终到Stop Analyzer documentation,说默认仍然是英语。
任何帮助?感谢
答案 0 :(得分:2)
50 static {
51 final List<String> stopWords = Arrays.asList(
52 "a", "an", "and", "are", "as", "at", "be", "but", "by",
53 "for", "if", "in", "into", "is", "it",
54 "no", "not", "of", "on", "or", "such",
55 "that", "the", "their", "then", "there", "these",
56 "they", "this", "to", "was", "will", "with"
57 );
58 final CharArraySet stopSet = new CharArraySet(Version.LUCENE_CURRENT,
59 stopWords, false);
60 ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet);
61 }
standard
的{Elasticsearch'源代码:https://github.com/elastic/elasticsearch/blob/v1.3.1/src/main/java/org/elasticsearch/index/analysis/StandardAnalyzerProvider.java#L47
指向Lucene StandardAnalyzer
的链接,后者依次引用StopAnalyzer
的停用词列表:http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-common/4.9.0/org/apache/lucene/analysis/standard/StandardAnalyzer.java?av=f#63