Lucene 6.1自定义标记器和分析器

时间:2016-07-30 21:20:07

标签: java lucene tokenize

我正在寻求Lucene 6.1 API的帮助。

我试图扩展Lucene的TokenizerAnalyzer,但我并不了解所有指南。在所有教程中,用户Tokenizer会覆盖增量。在构造函数中,他们有Reader个类,而在用户的Analyzer类中,它们会覆盖createComponents方法。但是在Lucene中它只有1个String参数,那么如何将Reader添加到我的Analyzer

我的代码:

public class ChemTokenizer extends Tokenizer{
    protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
    protected String stringToTokenize;
    protected int position = 0;
    protected List<int[]> chemicals = new ArrayList<>();

    @Override
    public boolean incrementToken() throws IOException {
        // Clear anything that is already saved in this.charTermAttribute
        this.charTermAttribute.setEmpty();

        // Get the position of the next symbol
        int nextIndex = -1;
        Pattern p = Pattern.compile("[^A-zА-я]");
        Matcher m = p.matcher(stringToTokenize.substring(position));
        nextIndex = m.start();
        // Did we lose chemicals?
        for (int[] pair: chemicals) {
            if (pair[0] < nextIndex && pair[1] > nextIndex) {
                //We are in the chemical name
                if (position == pair[0]) {
                    nextIndex = pair[1];
                }
                else {
                    nextIndex = pair[0];
                }
            }
        }
        // Next separator was found
        if (nextIndex != -1) {
            String nextToken = stringToTokenize.substring(position, nextIndex);
            charTermAttribute.append(nextToken);
            position = nextIndex + 1;
            return true;
        }
        // Last part of text
        else if (position < stringToTokenize.length()) {
            String nextToken = stringToTokenize.substring(position);
            charTermAttribute.append(nextToken);
            position = stringToTokenize.length();
            return true;
        }
        else {
            return false;
        }
    }
    public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
        int numChars;
        char[] buffer = new char[1024];
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                    reader.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        stringToTokenize = stringBuilder.toString();
        //Checking for keywords
        //Doesnt work properly if text has chemical synonyms
        for (String keyword: additionalKeywords) {
            int[] tmp = new int[2];
            //Start of keyword
            tmp[0] = stringToTokenize.indexOf(keyword);
            tmp[1] = tmp[0] + keyword.length() - 1;
            chemicals.add(tmp);
        }
    }

    /* Reset the stored position for this object when reset() is called.
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        position = 0;
        chemicals = new ArrayList<>();

    }
}

Analyzer的代码:

public class ChemAnalyzer extends Analyzer{

    List<String> additionalKeywords;
    public ChemAnalyzer(List<String> ad) {
        additionalKeywords = ad;
    }
    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {
        Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
        TokenStream filter = new LowerCaseFilter(tokenizer);
        return new TokenStreamComponents(tokenizer, filter);
    }

}

问题是此代码不适用于Lucene 6

1 个答案:

答案 0 :(得分:0)

这是我在github search中找到的,猜测你必须创建一个没有读取的新的标记器。

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    return new TokenStreamComponents(new WhitespaceTokenizer()); }