Question

我正在尝试为Lucene编写一个过滤器，类似于StopWordsFilter（因此实现了TokenFilter），但是我需要删除短语（令牌序列）而不是单词。

“停止短语”本身表示为一系列标记：不考虑标点符号。

我想我需要对令牌流中的令牌进行某种缓冲，当匹配完整的短语时，我会丢弃缓冲区中的所有令牌。

在给出像Lucene的TokenStream这样的单词流的情况下，实现“停止短语”过滤器的最佳方法是什么？

Answer 1

In this thread我得到了一个解决方案：使用Lucene的CachingTokenFilter作为起点：

该解决方案实际上是正确的方法。

编辑：我修复了死链接。这是该主题的成绩单。

我的问题：

我正在尝试使用新的TokenStream实现“停止短语过滤器” API。

我希望能够提前看到N个令牌，看看当前是不是令牌+ N后续令牌匹配“停止短语”（一组停止短语保存在HashSet中），然后在匹配a时丢弃所有这些令牌停止短语，或者如果它们不匹配则保留它们。

为此，我想使用captureState（）然后restoreState（）回到流的起点。

我尝试了很多这些API的组合。我的最后一次尝试是在代码中下面，这不起作用。

    static private HashSet<String> m_stop_phrases = new HashSet<String>(); 
    static private int m_max_stop_phrase_length = 0; 
... 
    public final boolean incrementToken() throws IOException { 
        if (!input.incrementToken()) 
            return false; 
        Stack<State> stateStack = new Stack<State>(); 
        StringBuilder match_string_builder = new StringBuilder(); 
        int skippedPositions = 0; 
        boolean is_next_token = true; 
        while (is_next_token && match_string_builder.length() < m_max_stop_phrase_length) { 
            if (match_string_builder.length() > 0) 
                match_string_builder.append(" "); 
            match_string_builder.append(termAtt.term()); 
            skippedPositions += posIncrAtt.getPositionIncrement(); 
            stateStack.push(captureState()); 
            is_next_token = input.incrementToken(); 
            if (m_stop_phrases.contains(match_string_builder.toString())) { 
              // Stop phrase is found: skip the number of tokens 
              // without restoring the state 
              posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions); 
              return is_next_token; 
            } 
        } 
        // No stop phrase found: restore the stream 
        while (!stateStack.empty()) 
            restoreState(stateStack.pop()); 
        return true; 
    }

我应该考虑哪个方向来实现我的“停止” 短语“过滤器？

正确的回答：

restoreState仅恢复令牌内容，而不是完整的流。所以你不能回滚令牌流（这也是不可能的旧API）。代码末尾的while循环不能像你一样工作因为这个而感到高兴。您可以使用CachingTokenFilter，它可以重置并再次消费，作为进一步工作的来源。

Answer 2

我认为你真的必须编写自己的分析器，因为某些单词序列是否是一个“短语”依赖于标记之类的标记，例如标点符号。

Lucene停止短语过滤

2 个答案: