我想使用Lucene API从句子中提取ngrams。但是我似乎遇到了一个特殊的问题。在JavaDoc中有一个名为NGramTokenizer的类。我已经下载了3.6.1和4.0 API,我没有看到这个类的任何痕迹。例如,当我尝试以下操作时,我收到一条错误,指出无法找到符号NGramTokenizer:
NGramTokenizer myTokenizer;
在文档中,似乎NGramTokenizer位于org.apache.lucene.analysis.NGramTokenizer路径中。我在电脑上的任何地方都看不到这一点。似乎没有发生下载或其他杂项错误,因为3.6.1和4.0 API都会发生这种错误
答案 0 :(得分:3)
您使用的是错误的jar。它在
lucene-analyzers-3.6.1.jar
org.apache.lucene.analysis.ngram.NGramTokenizer
答案 1 :(得分:0)
这是一个我经常使用的实用方法,因为有人需要帮助。应该使用lucene 4.10(我没有用更低或更高的版本进行测试)
private Set<String> generateNgrams(String sentence, int ngramCount) {
StringReader reader = new StringReader(sentence);
Set<String> ngrams = new HashSet<>();
//use lucene's shingle filter to generate the tokens
StandardTokenizer source = new StandardTokenizer(reader);
TokenStream tokenStream = new StandardFilter(source);
TokenFilter sf = null;
//if only unigrams are needed use standard filter else use shingle filter
if(ngramCount == 1){
sf = new StandardFilter(tokenStream);
}
else{
sf = new ShingleFilter(tokenStream);
((ShingleFilter)sf).setMaxShingleSize(ngramCount);
}
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
try {
sf.reset();
while (sf.incrementToken()) {
String token = charTermAttribute.toString().toLowerCase();
ngrams.add(token);
}
sf.end();
sf.close();
} catch (IOException ex) {
// System.err.println("Scream and cry as desired");
ex.printStackTrace();
}
return ngrams;
}
lucene所需的Maven依赖项是:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>4.10.3</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.10.3</version>
</dependency>