This question说看this question ...但不幸的是,这些聪明人的解决方案似乎不再适用于Lucene 6,因为createComponents
的签名现在是< / p>
TokenStreamComponents createComponents(final String fieldName)...
即。不再提供Reader
。
任何人都知道现在的技术应该是什么?我们是否打算将Reader
作为Analyzer
类的字段?
NB我实际上并不想过滤任何东西,我想抓住令牌流,以便创建我自己的数据结构(用于频率分析和序列匹配)。因此,我们的想法是使用Lucene的Analyzer
技术来生成不同的语料库模型。一个简单的例子可能是:一个模型,其中所有东西都是低层的,另一个是套管留在语料库中。
PS我也看到了this question:但我们必须再次提供Reader
:即我假设上下文是为查询目的而进行标记。当撰写索引时,虽然早期版本中的Analyzers
显然是在Reader
被调用的某个地方获得createComponents
,但您还没有一个Reader
(我知道......)
答案 0 :(得分:0)
得到它,再次使用referenced question中的技术......这基本上是&#34;干扰&#34;在Filters
Analyzer
的关键方法中应用createComponents
电池的某种方式。
因此,我的EnglishAnalyzer
:
private int nTerm = 0; // field added by me
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new EnglishPossessiveFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if (!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
// my modification starts here:
class ExamineFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public ExamineFilter( TokenStream in ) {
super( in);
}
@Override
protected boolean accept() throws IOException {
String term = new String( termAtt.buffer(), 0, termAtt.length() );
printOut( String.format( "# term %d |%s|", nTerm, term ));
// do all sorts of things with this term...
nTerm++;
return true;
}
}
class MyTokenStreamComponents extends TokenStreamComponents {
MyTokenStreamComponents( Tokenizer source, TokenStream result ){
super( source, result );
}
public TokenStream getTokenStream(){
// reset term count at start of each Document
nTerm = 0;
return super.getTokenStream();
}
}
result = new ExamineFilter( result );
return new MyTokenStreamComponents(source, result);
//
}
结果,输入:
String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ...
太棒了:
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|
# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|
...