我正在尝试使用Lucene对txt文件中的停用词进行标记化和删除。我有这个:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("an");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(token.toString());
System.out.println(sb);
}
return sb.toString();
}}
我的主要看起来像这样:
String file = "..../datatest.txt";
TestFileReader fr = new TestFileReader();
fr.imports(file);
System.out.println(fr.content);
String text = fr.content;
Stopwords stopwords = new Stopwords();
stopwords.removeStopWords(text);
System.out.println(stopwords.removeStopWords(text));
这给了我一个错误,但我无法弄清楚原因。
答案 0 :(得分:9)
我有同样的问题。要使用Lucene
删除停用词,您可以使用方法EnglishAnalyzer.getDefaultStopSet();
使用默认停止设置。否则,您可以创建自己的自定义停用词列表。
以下代码显示了removeStopWords()
的正确版本:
public static String removeStopWords(String textFile) throws Exception {
CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
sb.append(term + " ");
}
return sb.toString();
}
要使用自定义停用词列表,请使用以下命令:
//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
答案 1 :(得分:0)
你可以在调用tokenStream.incrementToken()之前尝试调用tokenStream.reset()
答案 2 :(得分:0)
Lucene进行了更改,因此建议的答案(于2014年发布)无法编译。这是与Lucene 8.6.3和Java 8一起使用的@ user1050755链接代码的稍有改动的版本:
final String text = "This is a short test!"
final List<String> stopWords = Arrays.asList("short","test"); //Filters both words
final CharArraySet stopSet = new CharArraySet(stopWords, true);
try {
ArrayList<String> remaining = new ArrayList<String>();
Analyzer analyzer = new StandardAnalyzer(stopSet); // Filters stop words in the given "stopSet"
//Analyzer analyzer = new StandardAnalyzer(); // Only filters punctuation marks out of the box, you have to provide your own stop words!
//Analyzer analyzer = new EnglishAnalyzer(); // Filters the default English stop words (see link below)
//Analyzer analyzer = new EnglishAnalyzer(stopSet); // Only uses the given "stopSet" but also runs a stemmer, so the result might not look like what you expected.
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.print("[" + term.toString() + "] ");
remaining.add(term.toString());
}
tokenStream.close();
analyzer.close();
} catch (IOException e) {
e.printStackTrace();
}
您可以在官方Github(here)上找到EnglishAnalyzer的默认停用词。
打印结果:
StandardAnalyzer(stopSet)
:[this] [is] [a]
StandardAnalyzer()
:[this] [is] [a] [short] [test]
EnglishAnalyzer()
:[this] [short] [test]
EnglishAnalyzer(stopSet)
:[thi] [is] [a]
(不,这不是错字,它确实输出thi
!)可以将默认停用词和您自己的停用词结合使用,但是最好为此使用CustomAnalyzer
(签出this answer)。