使用Lucene TokenFilter将标记分解为子标记

时间:2017-06-06 18:51:41

标签: java lucene

我的程序需要使用Lucene(4.10)非结构化文档编制索引,其中的内容可以是任何内容。所以我的自定义分析器正在利用ClassicTokenizer来首先标记文档。

但它并不完全符合我的需要,因为例如我希望能够搜索电子邮件地址的一部分或序列号的一部分(也可以是电话号码或包含数字的任何内容),可以写成1234.5678.9012或1234-5678-9012取决于编写索引文档的人员。

由于此ClassicTokenizer识别电子邮件并将数字作为整个令牌处理,因此最终生成的索引包括整个电子邮件地址和整个序列号,而我还希望将这些令牌分解为多个部分使用户能够以后搜索这些作品。

让我举一个具体的例子:如果输入文档的特征是xyz@gmail.com,则ClassicTokenizer会将其识别为电子邮件,因此将其标记为xyz@gmail.com。如果用户搜索xyz,他们将找不到任何内容,而搜索xyz@gmail.com将产生预期结果。

在阅读了大量博客帖子或SO问题后,我得出的结论是,一个解决方案可能是使用TokenFilter将电子邮件拆分成各个部分(在@符号的每一侧)。请注意,我不想用JFlex和co。

创建自己的标记器

处理电子邮件我写了以下代码,灵感来自Lucene在行动第2版的Synonymfilter:

public class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;

public SymbolSplitterFilter(TokenStream in) {
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
        return false;
    }

    final String currentTerm = termAtt.toString();

    System.err.println("The original word was " + termAtt.toString());
    final int bufferLength = termAtt.length();

    if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
        // If this is the first pass we fill in the stack with the terms
        if (termStack.isEmpty()) {
            // We split the token abc@cd.com into abc and cd.com
            termStack.addAll(Arrays.asList(currentTerm.split("@")));
            // Now we have the constituting terms of the email in the stack
            System.err.println("The terms on the stacks are ");
            for (int i = 0; i < termStack.size(); i++) {
                System.err.println(termStack.get(i));
                /** The terms on the stacks are 
                * xyz
                * gmail.com
                */

            }

            // I am not sure it is the right place for this.
             current = captureState();

        } else {
            // This part seems to never be reached!
            // We add the constituents terms as tokens.
            String part = termStack.pop();
            System.err.println("Current part is " + part);
            restoreState(current);
            termAtt.setEmpty().append(part);                 
            posIncAtt.setPositionIncrement(0);
        }
    }

    System.err.println("In the end we have " + termAtt.toString());
    // In the end we have xyz@gmail.com
    return true;

}

}

请注意:我刚刚开始发送电子邮件,这就是为什么我只展示了部分代码,但我必须增强我的代码以管理序列号(如前所述)< / em>的

但是永远不会处理堆栈。实际上,我无法弄清楚incrementToken方法的工作原理,尽管我读了这个SO问题以及它何时从TokenStream处理给定的令牌。

最后我要实现的目标是:对于xyz@gmail.com作为输入文本,我想生成以下子元素: xyz@gmail.com XYZ gmail.com

任何帮助表示赞赏,

1 个答案:

答案 0 :(得分:2)

您的问题是,第一次填充堆栈时输入TokenStream已经用尽。所以input.incrementToken()返回false。 在增加输入之前,应检查堆栈是否先填充。像这样:

public final class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAtt;

public SymbolSplitterFilter(TokenStream in)
{
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
    typeAtt = addAttribute(TypeAttribute.class);
}

@Override
public boolean incrementToken() throws IOException
{
    if (!this.termStack.isEmpty()) {
        String part = termStack.pop();
        restoreState(current);
        termAtt.setEmpty().append(part);
        posIncAtt.setPositionIncrement(0);
        return true;
    } else if (!input.incrementToken()) {
        return false;
    } else {
        final String currentTerm = termAtt.toString();
        final int bufferLength = termAtt.length();

        if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
            if (termStack.isEmpty()) {
                termStack.addAll(Arrays.asList(currentTerm.split("@")));
                current = captureState();
            }
        }
        return true;

    }

}
}

请注意,您可能还需要更正偏移量并更改令牌的顺序,因为测试会显示您生成的令牌:

 public class SymbolSplitterFilterTest extends BaseTokenStreamTestCase {


@Test
public void testSomeMethod() throws IOException
{
    Analyzer analyzer = this.getAnalyzer();
    assertAnalyzesTo(analyzer, "hey xyz@example.com",
        new String[]{"hey", "xyz@example.com", "example.com", "xyz"},
        new int[]{0, 4, 4, 4},
        new int[]{3, 19, 19, 19},
        new String[]{"word", "word", "word", "word"},
        new int[]{1, 1, 0, 0}
        );
}

 private Analyzer getAnalyzer()
{
    return new Analyzer()
    {
        @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName)
        {
            Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
            SymbolSplitterFilter testFilter = new SymbolSplitterFilter(tokenizer);
            return new Analyzer.TokenStreamComponents(tokenizer, testFilter);
        }
    };
}

}