documentation在“创建代理”部分中建议的一种天真的方法无法正常工作,因为它会导致委托令牌合同违规:
private static class TokenizerWrapper extends Tokenizer {
public TokenizerWrapper(Reader _input) {
super(_input);
delegate = new WhitespaceTokenizer(input);
}
@Override
public void reset() throws IOException {
logger.info("TokenizerWrapper.reset()");
super.reset();
delegate.setReader(input);
delegate.reset();
}
@Override
public final boolean incrementToken() throws IOException {
logger.info("TokenizerWrapper.incrementToken()");
return delegate.incrementToken();
}
private final WhitespaceTokenizer delegate;
}
给我以下日志:
14:30:12.885 [main] INFO test.GapTest - TokenizerWrapper.reset()
14:30:12.886 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.897 [main] INFO test.GapTest - TokenizerWrapper.reset()
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at test.GapTest$TestTokenizer.reset(GapTest.java:152)
at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:599)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
at test.GapTest.main(GapTest.java:67)
覆盖close()方法,如下所示:
@Override
public void close() throws IOException {
logger.info("TokenizerWrapper.close()");
super.close();
logger.info("TokenizerWrapper.delegate.close()");
tokenizer.close();
// tokenizer.setReader(input);
}
除了错误之外没有任何帮助:
15:36:49.561 [main] INFO test.GapTest - setting field "text" to "some text"
15:36:49.569 [main] INFO test.GapTest - Adding created document to the index
15:36:49.605 [main] INFO test.GapTest - createComponents()
15:36:49.633 [main] INFO test.GapTest - TokenizerWrapper(_input)
15:36:49.638 [main] INFO test.GapTest - TokenizerWrapper.reset()
15:36:49.639 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.641 [main] INFO test.GapTest - TokenizerWrapper.close()
15:36:49.641 [main] INFO test.GapTest - TokenizerWrapper.delegate.close()
15:36:49.648 [main] INFO test.GapTest - setting field "text" to "some text 1"
15:36:49.648 [main] INFO test.GapTest - Adding created document to the index
15:36:49.648 [main] INFO test.GapTest - TokenizerWrapper.reset()
15:36:49.648 [main] INFO test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.649 [main] INFO test.GapTest - TokenizerWrapper.close()
15:36:49.649 [main] INFO test.GapTest - TokenizerWrapper.delegate.close()
Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'address'
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:617)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
at test.GapTest.main(GapTest.java:72)
即
invertState.posIncrAttribute.getPositionIncrement(IndexableField field, boolean first)
返回0,而它的“正常”行为是返回1)当然,我可以通过进一步包装和解决方法来处理这个特定的错误,但可能我在实现这样一个看似简单的任务时有错误的方向。请建议。
答案 0 :(得分:2)
我在项目中创建了一个抽象类,可以解决这个问题。关键的地方当然是incrementToken
,reset
,close
和end
方法。随意使用这些位或整个事物。
import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;
import com.google.common.collect.Iterators;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import static vyre.util.search.LuceneVersion.VERSION_IN_USE;
/**
* Allows to easily manipulate with {@link ClassicTokenizer} by delegating calls to it but hiding all implementation details.
*
* @author Mindaugas Žakšauskas
*/
public abstract class ClassicTokenizerDelegate extends Tokenizer {
private final ClassicTokenizer classicTokenizer;
private final CharTermAttribute termAtt;
private final TypeAttribute typeAtt;
/**
* Internal buffer of tokens if any of standard tokens was split into many.
*/
private Iterator<String> pendingTokens = Iterators.emptyIterator();
protected ClassicTokenizerDelegate(Reader input) {
super(input);
this.classicTokenizer = new ClassicTokenizer(VERSION_IN_USE, input);
termAtt = addAttribute(CharTermAttribute.class);
typeAtt = addAttribute(TypeAttribute.class);
}
/**
* Is called during tokenization for each token produced by {@link ClassicTokenizer}. Subclasses can call {@link #setTerm(String)} to override
* current token or {@link #setTerms(Iterator)} if current token needs to be split into more than one token.
*
* @return true whether next token exists false otherwise.
* @see #getTerm()
* @see #getType()
* @see #setTerm(String)
* @see #setTerms(Iterator)
*/
protected abstract boolean onNextToken();
/**
* Subclasses can call this method during execution of {@link #onNextToken()} to retrieve current term.
*
* @return current term.
* @see #getType()
* @see #setTerm(String)
* @see #setTerms(Iterator)
* @see #onNextToken()
*/
protected String getTerm() {
return new String(termAtt.buffer(), 0, termAtt.length());
}
/**
* Subclasses can call this method during execution of {@link #onNextToken()} to retrieve type of current term.
*
* @return type of current term.
* @see #getTerm()
* @see #setTerm(String)
* @see #setTerms(Iterator)
* @see #onNextToken()
*/
protected String getType() {
return typeAtt.type();
}
/**
* Subclasses can call this method during execution of {@link #onNextToken()} to override current term.
*
* @param term the term to override with.
* @see #getTerm()
* @see #getType()
* @see #setTerms(Iterator) setTerms(Iterator) - if you want to override current term with more than one term
* @see #onNextToken()
*/
protected void setTerm(String term) {
termAtt.copyBuffer(term.toCharArray(), 0, term.length());
}
/**
* Subclasses can call this method during execution of {@link #onNextToken()} to override current term with more than one term.
*
* @param terms the terms to override with.
* @see #getTerm()
* @see #getType()
* @see #setTerm(String)
* @see #onNextToken()
*/
protected void setTerms(Iterator<String> terms) {
setTerm(terms.next());
pendingTokens = terms;
}
@Override
public final boolean incrementToken() throws IOException {
if (pendingTokens.hasNext()) {
setTerm(pendingTokens.next());
return true;
}
clearAttributes();
if (!classicTokenizer.incrementToken()) {
return false;
}
typeAtt.setType(classicTokenizer.getAttribute(TypeAttribute.class).type()); // copy type attribute from classic tokenizer attribute
CharTermAttribute stTermAtt = classicTokenizer.getAttribute(CharTermAttribute.class);
setTerm(new String(stTermAtt.buffer(), 0, stTermAtt.length()));
return onNextToken();
}
@Override
public void close() throws IOException {
super.close();
if (input != null) {
input.close();
}
classicTokenizer.close();
}
@Override
public void end() throws IOException {
super.end();
classicTokenizer.end();
}
@Override
public void reset() throws IOException {
super.reset();
this.classicTokenizer.setReader(input); // important! input has to be carried over to delegate because of poor design of Lucene
classicTokenizer.reset();
}
}
答案 1 :(得分:0)
我认为明确表达它会很有用:
TokenizerWrapper
和delegate
不共享属性集。因此,即使第一个文档的索引似乎是正确的,但事实并非,索引器没有任何内容。为了进行有意义的委托,需要镜像(完全或部分)delegate
中TokenizerWrapper
的属性,就像@mindas在setTerm()
中所做的那样。
或许我错了,还有一些神奇的机器&#34;哪个允许重用delegate.attributes
作为TokenizerWrapper.attributes
?