我为我的域构建了一个自定义Lucene分析器,但在得到意外结果后,我决定获得该分析器的TokenStream
并手动调试它。
通过这样做我发现它似乎没有过滤我的停用词,我无法理解为什么。
这是我用作测试的文本(随机推特项目,对不起,如果它没有任何意义):
Ke #rosicata il #pareggio nel #recupero del #recupero di #vantaggiato del #Livorno a #Catania 1-1 #finale spegne i #sogni degli #etnei di #raggiungere i #playoff per la #promozione in #serieA# catanialivorno square squareformat iphoneography instagramapp上传:by = instagram
这些是我的停用词(存储在由分析仪加载的文件中):
square
squareformat
iphoneography
instagramapp
uploaded:by=instagram
最后,这是输出(一行=一个标记):
rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
seriea catanialivorno square
catanialivorno
catanialivorno square
catanialivorno square squareformat
square
square squareformat
square squareformat iphoneography
squareformat
squareformat iphoneography
squareformat iphoneography instagramapp
iphoneography
iphoneography instagramapp
instagramapp
instagram
如您所见,最后一行包含我想用我的过滤器删除的内容。
这是过滤器代码:
@Override
protected TokenStreamComponents createComponents(String string) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
TokenStream tokenStream = new StandardFilter(src);
// From StandardAnalyzer
tokenStream = new LowerCaseFilter(tokenStream);
// Custom filters
// Filter emails, uris and numbers
Set<String> stopTypes = new HashSet<>();
stopTypes.add("<URL>");
stopTypes.add("<NUM>");
stopTypes.add("<EMAIL>");
tokenStream = new TypeTokenFilter(tokenStream, stopTypes);
// Non latin removal
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\P{InBasic_Latin}"), "", true);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("^(?=.*\\d).+$"), "", true);
// Remove words containing www
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*(www).*"), "", true);
// Remove special tags like uploaded:by=instagram
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);
// Remove words shorter than 3 characters
tokenStream = new LengthFilter(tokenStream, 3, 25);
// Stopwords
tokenStream = new StopFilter(tokenStream, stopwordsCollection);
// N-Grams
tokenStream = new ShingleFilter(tokenStream, 3);
// HACK - ShingleFilter uses fillers like _ for some reasons and there's no way to disable it now, so we replace all the _ with empty strings
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*_.*"), "", true);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\b(\\w+)\\s+\\1\\b"), "", true);
// Stopwords
tokenStream = new StopFilter(tokenStream, stopwordsCollection);
// Final trim
tokenStream = new TrimFilter(tokenStream);
// Set CharTerm attribute
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.addAttribute(FuzzyTermsEnum.LevenshteinAutomataAttribute.class);
return new TokenStreamComponents(src, tokenStream) {
@Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
为了仔细检查,我在返回此方法之前放置了一个断点,stopwordsCollection
是一个CharArraySet
,它包含了我在文件中的相同字词(所以它们'重新加载正确)。
我的第一个想法是Shingle过滤器弄乱了删除停用词,但现在我只看到输出包含 square ,这是一个单字停用词。
有人可以帮我解决这个问题吗?
编辑:为了清楚起见,我还添加了用于打印令牌的代码。TokenStream tokenStream = null;
try {
tokenStream = new MyAnalyzer().tokenStream("text", item.toString());
tokenStream.reset();
// Iterate over the stream to process single words
while (tokenStream.incrementToken()) {
CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
System.out.println(charTerm.toString());
}
// Perform end-of-stream operations, e.g. set the final offset.
tokenStream.end();
} catch (IOException ex) {
} finally {
try {
// Close the stream to release resources
if (tokenStream != null) {
tokenStream.close();
}
} catch (IOException ex) {
}
}
编辑2:事实证明我存储了带有尾随空格的停用词,但仍然存在一个问题。
当前输出为:
rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
catanialivorno
instagram
正如您所看到的,最后一个单词是 instagram ,来自上传:by = instagram 。现在,上传:by = instagram 也是一个禁用词,我仍然使用基于正则表达式的过滤器来删除这种模式
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);
我还把它作为第一个过滤器移动了,我仍然得到 instagram 。