Question

我是Lucene的新手，我希望从大文本文件中的句子中删除停用词。每个句子都存储在文本文件的单独行中。我目前的代码是：

    Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_41, new StringReader("if everyone got spam from me im extremely sorry"));

    final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_41, tokenizer);
    final StopFilter stopFilter = new StopFilter(Version.LUCENE_41, standardFilter, sa.getStopwordSet());

    final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    try{
        stopFilter.reset();

        while(stopFilter.incrementToken()) {
            final String token = charTermAttribute.toString().toString();
            System.out.printf("%s ", token);
        }

    }catch(Exception ex){

    }

但是，正如您所看到的，StringReader只有一个预定义的句子。现在，我想知道如何才能这样做，所以我可以从我的文本文件中读取所有句子中的程序？

提前致谢！

Lucene从文件中删除停用词

0 个答案: