Lucene的StopFilter没有删除停用词?

时间:2016-12-10 09:30:49

标签: java lucene stop-words

我为我的域构建了一个自定义Lucene分析器,但在得到意外结果后,我决定获得该分析器的TokenStream并手动调试它。

通过这样做我发现它似乎没有过滤我的停用词,我无法理解为什么。

这是我用作测试的文本(随机推特项目,对不起,如果它没有任何意义):

  

Ke #rosicata il #pareggio nel #recupero del #recupero di #vantaggiato del #Livorno a #Catania 1-1 #finale spegne i #sogni degli #etnei di #raggiungere i #playoff per la #promozione in #serieA# catanialivorno square squareformat iphoneography instagramapp上传:by = instagram

这些是我的停用词(存储在由分析仪加载的文件中):

square 
squareformat 
iphoneography 
instagramapp 
uploaded:by=instagram

最后,这是输出(一行=一个标记):

rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
seriea catanialivorno square
catanialivorno
catanialivorno square
catanialivorno square squareformat
square
square squareformat
square squareformat iphoneography
squareformat
squareformat iphoneography
squareformat iphoneography instagramapp
iphoneography
iphoneography instagramapp
instagramapp
instagram

如您所见,最后一行包含我想用我的过滤器删除的内容。

这是过滤器代码:

@Override
protected TokenStreamComponents createComponents(String string) {
    final StandardTokenizer src = new StandardTokenizer();
    src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
    TokenStream tokenStream = new StandardFilter(src);
    // From StandardAnalyzer
    tokenStream = new LowerCaseFilter(tokenStream);
    // Custom filters
    // Filter emails, uris and numbers
    Set<String> stopTypes = new HashSet<>();
    stopTypes.add("<URL>");
    stopTypes.add("<NUM>");
    stopTypes.add("<EMAIL>");
    tokenStream = new TypeTokenFilter(tokenStream, stopTypes);
    // Non latin removal
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\P{InBasic_Latin}"), "", true);
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("^(?=.*\\d).+$"), "", true);
    // Remove words containing www
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*(www).*"), "", true);
    // Remove special tags like uploaded:by=instagram
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);
    // Remove words shorter than 3 characters
    tokenStream = new LengthFilter(tokenStream, 3, 25);
    // Stopwords
    tokenStream = new StopFilter(tokenStream, stopwordsCollection);
    // N-Grams
    tokenStream = new ShingleFilter(tokenStream, 3);
    // HACK - ShingleFilter uses fillers like _ for some reasons and there's no way to disable it now, so we replace all the _ with empty strings
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*_.*"), "", true);
    tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\b(\\w+)\\s+\\1\\b"), "", true);
    // Stopwords
    tokenStream = new StopFilter(tokenStream, stopwordsCollection);
    // Final trim
    tokenStream = new TrimFilter(tokenStream);
    // Set CharTerm attribute
    tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.addAttribute(FuzzyTermsEnum.LevenshteinAutomataAttribute.class);
    return new TokenStreamComponents(src, tokenStream) {
        @Override
        protected void setReader(final Reader reader) {
            src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
            super.setReader(reader);
        }
    };        
}

为了仔细检查,我在返回此方法之前放置了一个断点,stopwordsCollection是一个CharArraySet,它包含了我在文件中的相同字词(所以它们'重新加载正确)。

我的第一个想法是Shingle过滤器弄乱了删除停用词,但现在我只看到输出包含 square ,这是一个单字停用词。

有人可以帮我解决这个问题吗?

编辑:为了清楚起见,我还添加了用于打印令牌的代码。

TokenStream tokenStream = null;
try {
    tokenStream = new MyAnalyzer().tokenStream("text", item.toString());
    tokenStream.reset();
    // Iterate over the stream to process single words
    while (tokenStream.incrementToken()) {
        CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
        System.out.println(charTerm.toString());
    }
    // Perform end-of-stream operations, e.g. set the final offset.
    tokenStream.end();
} catch (IOException ex) {
} finally {
    try {
        // Close the stream to release resources
        if (tokenStream != null) {
            tokenStream.close();
        }
    } catch (IOException ex) {
    }
}

编辑2:事实证明我存储了带有尾随空格的停用词,但仍然存在一个问题。

当前输出为:

rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
catanialivorno
instagram

正如您所看到的,最后一个单词是 instagram ,来自上传:by = instagram 。现在,上传:by = instagram 也是一个禁用词,我仍然使用基于正则表达式的过滤器来删除这种模式

tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);

我还把它作为第一个过滤器移动了,我仍然得到 instagram

0 个答案:

没有答案