如何在lucene 4.x中正确实现委派标记器?

时间:2014-11-18 08:49:37

标签: java lucene delegates tokenize

documentation在“创建代理”部分中建议的一种天真的方法无法正常工作,因为它会导致委托令牌合同违规:

private static class TokenizerWrapper extends Tokenizer {
  public TokenizerWrapper(Reader _input) {
    super(_input);
    delegate = new WhitespaceTokenizer(input);
  }

  @Override
  public void reset() throws IOException {
    logger.info("TokenizerWrapper.reset()");
    super.reset();
    delegate.setReader(input);
    delegate.reset();
  }

  @Override
  public final boolean incrementToken() throws IOException {
    logger.info("TokenizerWrapper.incrementToken()");
    return delegate.incrementToken();
  }

  private final WhitespaceTokenizer delegate;
}

给我以下日志:

14:30:12.885 [main] INFO  test.GapTest - TokenizerWrapper.reset()
14:30:12.886 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.897 [main] INFO  test.GapTest - TokenizerWrapper.reset()
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at test.GapTest$TestTokenizer.reset(GapTest.java:152)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:599)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:67)

覆盖close()方法,如下所示:

  @Override
  public void close() throws IOException {
    logger.info("TokenizerWrapper.close()");
    super.close();
    logger.info("TokenizerWrapper.delegate.close()");
    tokenizer.close();
    // tokenizer.setReader(input);
  }

除了错误之外没有任何帮助:

15:36:49.561 [main] INFO  test.GapTest - setting field "text" to "some text"
15:36:49.569 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.605 [main] INFO  test.GapTest - createComponents()
15:36:49.633 [main] INFO  test.GapTest - TokenizerWrapper(_input)
15:36:49.638 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.639 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
15:36:49.648 [main] INFO  test.GapTest - setting field "text" to "some text 1"
15:36:49.648 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'address'
    at    org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:617)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:72)

  1. 已成功处理第一个文档(“text”字段中的“some text”),
  2. 然后开始处理第二个文件(“some text 1”),
  3. [貌似]成功处理了第一个令牌(单词“some”,我在调试器中检查了这个),
  4. 然后打破了不一致的内部状态(DefaultIndexingChain.PerField.invert()中的invertState.posIncrAttribute.getPositionIncrement(IndexableField field, boolean first)返回0,而它的“正常”行为是返回1)
  5. 当然,我可以通过进一步包装和解决方法来处理这个特定的错误,但可能我在实现这样一个看似简单的任务时有错误的方向。请建议。

2 个答案:

答案 0 :(得分:2)

我在项目中创建了一个抽象类,可以解决这个问题。关键的地方当然是incrementTokenresetcloseend方法。随意使用这些位或整个事物。

import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import com.google.common.collect.Iterators;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import static vyre.util.search.LuceneVersion.VERSION_IN_USE;

/**
 * Allows to easily manipulate with {@link ClassicTokenizer} by delegating calls to it but hiding all implementation details.
 *
 * @author Mindaugas Žakšauskas
 */
public abstract class ClassicTokenizerDelegate extends Tokenizer {

    private final ClassicTokenizer classicTokenizer;

    private final CharTermAttribute termAtt;

    private final TypeAttribute typeAtt;

    /**
     * Internal buffer of tokens if any of standard tokens was split into many.
     */
    private Iterator<String> pendingTokens = Iterators.emptyIterator();

    protected ClassicTokenizerDelegate(Reader input) {
        super(input);
        this.classicTokenizer = new ClassicTokenizer(VERSION_IN_USE, input);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
    }

    /**
     * Is called during tokenization for each token produced by {@link ClassicTokenizer}. Subclasses can call {@link #setTerm(String)} to override
     * current token or {@link #setTerms(Iterator)} if current token needs to be split into more than one token.
     *
     * @return true whether next token exists false otherwise.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     */
    protected abstract boolean onNextToken();

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve current term.
     *
     * @return current term.
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getTerm() {
        return new String(termAtt.buffer(), 0, termAtt.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve type of current term.
     *
     * @return type of current term.
     * @see #getTerm()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getType() {
        return typeAtt.type();
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term.
     *
     * @param term the term to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerms(Iterator) setTerms(Iterator) - if you want to override current term with more than one term
     * @see #onNextToken()
     */
    protected void setTerm(String term) {
        termAtt.copyBuffer(term.toCharArray(), 0, term.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term with more than one term.
     *
     * @param terms the terms to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #onNextToken()
     */
    protected void setTerms(Iterator<String> terms) {
        setTerm(terms.next());
        pendingTokens = terms;
    }

    @Override
    public final boolean incrementToken() throws IOException {
        if (pendingTokens.hasNext()) {
            setTerm(pendingTokens.next());
            return true;
        }

        clearAttributes();
        if (!classicTokenizer.incrementToken()) {
            return false;
        }

        typeAtt.setType(classicTokenizer.getAttribute(TypeAttribute.class).type());        // copy type attribute from classic tokenizer attribute

        CharTermAttribute stTermAtt = classicTokenizer.getAttribute(CharTermAttribute.class);
        setTerm(new String(stTermAtt.buffer(), 0, stTermAtt.length()));

        return onNextToken();
    }

    @Override
    public void close() throws IOException {
        super.close();
        if (input != null) {
            input.close();
        }
        classicTokenizer.close();
    }

    @Override
    public void end() throws IOException {
        super.end();
        classicTokenizer.end();
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        this.classicTokenizer.setReader(input);        // important! input has to be carried over to delegate because of poor design of Lucene
        classicTokenizer.reset();
    }
}

答案 1 :(得分:0)

我认为明确表达它会很有用:

TokenizerWrapperdelegate不共享属性集。因此,即使第一个文档的索引似乎是正确的,但事实并非,索引器没有任何内容。为了进行有意义的委托,需要镜像(完全或部分)delegateTokenizerWrapper的属性,就像@mindas在setTerm()中所做的那样。

或许我错了,还有一些神奇的机器&#34;哪个允许重用delegate.attributes作为TokenizerWrapper.attributes