java lucene自定义分析器和tokenizer在termvector偏移中创建问题?

时间:2011-06-09 13:14:59

标签: java lucene analyzer

我遇到了关于lucene termvector偏移的问题,当我使用我的自定义分析器分析一个字段时,它将给出termvector的无效偏移但是标准分析器没问题,这是我的分析器代码

public class AttachmentNameAnalyzer extends Analyzer {
    private boolean stemmTokens;
    private String name;

    public AttachmentNameAnalyzer(boolean stemmTokens, String name) {
        super();
        this.stemmTokens    = stemmTokens;
        this.name           = name;
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream stream = new AttachmentNameTokenizer(reader);
        if (stemmTokens)
            stream = new SnowballFilter(stream, name);
        return stream;
    }

    @Override
    public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream);
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader);
        }

        return stream;
    }
}

这个“需要帮助”是错误的

2 个答案:

答案 0 :(得分:0)

您使用的是哪个版本的Lucene? 我正在查看super class code 3x分支 和行为随每个版本而变化。

您可能需要查看计算public final boolean incrementToken()的{​​{1}}代码。

我也看到了这个:

offset

顺便说一下,你可以像

一样重写switch语句
/**
 * <p>
 * As of Lucene 3.1 the char based API ({@link #isTokenChar(char)} and
 * {@link #normalize(char)}) has been depreciated in favor of a Unicode 4.0
 * compatible int based API to support codepoints instead of UTF-16 code
 * units. Subclasses of {@link CharTokenizer} must not override the char based
 * methods if a {@link Version} >= 3.1 is passed to the constructor.
 * <p>
 * <p>
 * NOTE: This method will be marked <i>abstract</i> in Lucene 4.0.
 * </p>
 */

答案 1 :(得分:0)

分析器的问题,因为我之前发布了分析器的代码,实际上,令牌流需要为每个要标记化的新文本条目休息。

 public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream); // --------------->  problem was here 
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader); 
        }

        return stream;
    }

每次当我设置前一个令牌流时,下一个要进行的文本字段必须单独进行标记,它始终以最后一个令牌流的结束偏移开始,这使得新流的术语向量偏移错误现在它工作得很好

ublic TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
            TokenStream stream = (TokenStream) getPreviousTokenStream();

            if (stream == null) {
                stream = new AttachmentNameTokenizer(reader);
                if (stemmTokens)
                    stream = new SnowballFilter(stream, name);
            } else if (stream instanceof Tokenizer) {
                ( (Tokenizer) stream ).reset(reader); 
            }

            return stream;
        }