Question

我正在尝试使用Lucene对txt文件中的停用词进行标记化和删除。我有这个：

public String removeStopWords(String string) throws IOException {

Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("an");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);

    StringBuilder sb = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(token.toString());
    System.out.println(sb);    
    }
    return sb.toString();
}}

我的主要看起来像这样：

    String file = "..../datatest.txt";

    TestFileReader fr = new TestFileReader();
    fr.imports(file);
    System.out.println(fr.content);

    String text = fr.content;

    Stopwords stopwords = new Stopwords();
    stopwords.removeStopWords(text);
    System.out.println(stopwords.removeStopWords(text));

这给了我一个错误，但我无法弄清楚原因。

Answer 1

我有同样的问题。要使用Lucene删除停用词，您可以使用方法EnglishAnalyzer.getDefaultStopSet();使用默认停止设置。否则，您可以创建自己的自定义停用词列表。

以下代码显示了removeStopWords()的正确版本：

public static String removeStopWords(String textFile) throws Exception {
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        sb.append(term + " ");
    }
    return sb.toString();
}

要使用自定义停用词列表，请使用以下命令：

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

Answer 2

你可以在调用tokenStream.incrementToken（）之前尝试调用tokenStream.reset（）

Answer 3

Lucene进行了更改，因此建议的答案（于2014年发布）无法编译。这是与Lucene 8.6.3和Java 8一起使用的@ user1050755链接代码的稍有改动的版本：

final String text = "This is a short test!"
final List<String> stopWords = Arrays.asList("short","test"); //Filters both words
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    ArrayList<String> remaining = new ArrayList<String>();

    Analyzer analyzer = new StandardAnalyzer(stopSet); // Filters stop words in the given "stopSet"
    //Analyzer analyzer = new StandardAnalyzer(); // Only filters punctuation marks out of the box, you have to provide your own stop words!
    //Analyzer analyzer = new EnglishAnalyzer(); // Filters the default English stop words (see link below)
    //Analyzer analyzer = new EnglishAnalyzer(stopSet); // Only uses the given "stopSet" but also runs a stemmer, so the result might not look like what you expected.
    
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
        remaining.add(term.toString());
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    e.printStackTrace();
}

您可以在官方Github（here）上找到EnglishAnalyzer的默认停用词。

打印结果：

StandardAnalyzer(stopSet)：[this] [is] [a]
StandardAnalyzer()：[this] [is] [a] [short] [test]
EnglishAnalyzer()：[this] [short] [test]
EnglishAnalyzer(stopSet)：[thi] [is] [a] （不，这不是错字，它确实输出thi！）

可以将默认停用词和您自己的停用词结合使用，但是最好为此使用CustomAnalyzer（签出this answer）。

Tokenize，使用Lucene和Java删除停用词

3 个答案: