Lucene 6 - 如何在编写索引时拦截标记化?

时间:2017-03-05 21:20:46

标签: java lucene tokenize analyzer

This question说看this question ...但不幸的是,这些聪明人的解决方案似乎不再适用于Lucene 6,因为createComponents的签名现在是< / p>

TokenStreamComponents createComponents(final String fieldName)...

即。不再提供Reader

任何人都知道现在的技术应该是什么?我们是否打算将Reader作为Analyzer类的字段?

NB我实际上并不想过滤任何东西,我想抓住令牌流,以便创建我自己的数据结构(用于频率分析和序列匹配)。因此,我们的想法是使用Lucene的Analyzer技术来生成不同的语料库模型。一个简单的例子可能是:一个模型,其中所有东西都是低层的,另一个是套管留在语料库中。

PS我也看到了this question:但我们必须再次提供Reader:即我假设上下文是为查询目的而进行标记。当撰写索引时,虽然早期版本中的Analyzers显然是在Reader被调用的某个地方获得createComponents,但您还没有一个Reader(我知道......)

1 个答案:

答案 0 :(得分:0)

得到它,再次使用referenced question中的技术......这基本上是&#34;干扰&#34;在Filters Analyzer的关键方法中应用createComponents电池的某种方式。

因此,我的EnglishAnalyzer

的篡改版本
private int nTerm = 0; // field added by me

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new EnglishPossessiveFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    if (!stemExclusionSet.isEmpty())
        result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);

    // my modification starts here:
    class ExamineFilter extends FilteringTokenFilter {
        private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
        public ExamineFilter( TokenStream in ) {
                super(  in);
          }         
        @Override
        protected boolean accept() throws IOException {
            String term = new String( termAtt.buffer(), 0, termAtt.length() );
            printOut( String.format( "# term %d |%s|", nTerm, term ));

            // do all sorts of things with this term... 

            nTerm++;
            return true;
        }
    }
    class MyTokenStreamComponents extends TokenStreamComponents {
        MyTokenStreamComponents( Tokenizer source, TokenStream result ){
            super( source, result );
        }
        public TokenStream getTokenStream(){
            // reset term count at start of each Document
            nTerm = 0;
            return super.getTokenStream();
        }
    }
    result = new ExamineFilter( result );
    return new MyTokenStreamComponents(source, result);
    //
}

结果,输入:

    String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ... 

太棒了:

# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|

# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|

# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|

...