我是Lucene的新手,我希望从大文本文件中的句子中删除停用词。每个句子都存储在文本文件的单独行中。我目前的代码是:
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_41, new StringReader("if everyone got spam from me im extremely sorry"));
final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_41, tokenizer);
final StopFilter stopFilter = new StopFilter(Version.LUCENE_41, standardFilter, sa.getStopwordSet());
final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
try{
stopFilter.reset();
while(stopFilter.incrementToken()) {
final String token = charTermAttribute.toString().toString();
System.out.printf("%s ", token);
}
}catch(Exception ex){
}
但是,正如您所看到的,StringReader只有一个预定义的句子。现在,我想知道如何才能这样做,所以我可以从我的文本文件中读取所有句子中的程序?
提前致谢!