我遇到了关于lucene termvector偏移的问题,当我使用我的自定义分析器分析一个字段时,它将给出termvector的无效偏移但是标准分析器没问题,这是我的分析器代码
public class AttachmentNameAnalyzer extends Analyzer {
private boolean stemmTokens;
private String name;
public AttachmentNameAnalyzer(boolean stemmTokens, String name) {
super();
this.stemmTokens = stemmTokens;
this.name = name;
}
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new AttachmentNameTokenizer(reader);
if (stemmTokens)
stream = new SnowballFilter(stream, name);
return stream;
}
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
TokenStream stream = (TokenStream) getPreviousTokenStream();
if (stream == null) {
stream = new AttachmentNameTokenizer(reader);
if (stemmTokens)
stream = new SnowballFilter(stream, name);
setPreviousTokenStream(stream);
} else if (stream instanceof Tokenizer) {
( (Tokenizer) stream ).reset(reader);
}
return stream;
}
}
这个“需要帮助”是错误的
答案 0 :(得分:0)
您使用的是哪个版本的Lucene? 我正在查看super class code 3x分支 和行为随每个版本而变化。
您可能需要查看计算public final boolean incrementToken()
的{{1}}代码。
我也看到了这个:
offset
顺便说一下,你可以像
一样重写switch语句/**
* <p>
* As of Lucene 3.1 the char based API ({@link #isTokenChar(char)} and
* {@link #normalize(char)}) has been depreciated in favor of a Unicode 4.0
* compatible int based API to support codepoints instead of UTF-16 code
* units. Subclasses of {@link CharTokenizer} must not override the char based
* methods if a {@link Version} >= 3.1 is passed to the constructor.
* <p>
* <p>
* NOTE: This method will be marked <i>abstract</i> in Lucene 4.0.
* </p>
*/
答案 1 :(得分:0)
分析器的问题,因为我之前发布了分析器的代码,实际上,令牌流需要为每个要标记化的新文本条目休息。
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
TokenStream stream = (TokenStream) getPreviousTokenStream();
if (stream == null) {
stream = new AttachmentNameTokenizer(reader);
if (stemmTokens)
stream = new SnowballFilter(stream, name);
setPreviousTokenStream(stream); // ---------------> problem was here
} else if (stream instanceof Tokenizer) {
( (Tokenizer) stream ).reset(reader);
}
return stream;
}
每次当我设置前一个令牌流时,下一个要进行的文本字段必须单独进行标记,它始终以最后一个令牌流的结束偏移开始,这使得新流的术语向量偏移错误现在它工作得很好
ublic TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
TokenStream stream = (TokenStream) getPreviousTokenStream();
if (stream == null) {
stream = new AttachmentNameTokenizer(reader);
if (stemmTokens)
stream = new SnowballFilter(stream, name);
} else if (stream instanceof Tokenizer) {
( (Tokenizer) stream ).reset(reader);
}
return stream;
}