我目前正在为Mahout群集项目开发自定义分析器。由于Mahout 0.8将Lucene更新为4.3,因此无法从书中过时的样本生成标记化文档文件或SequenceFile。以下代码是我对本书Mahout in Action中的示例代码的修订版。但是,它给了我非法的例外。
public class MyAnalyzer extends Analyzer {
private final Pattern alphabets = Pattern.compile("[a-z]+");
Version version = Version.LUCENE_43;
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new StandardTokenizer(version, reader);
TokenStream filter = new StandardFilter(version, source);
filter = new LowerCaseFilter(version, filter);
filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
CharTermAttribute termAtt = (CharTermAttribute)filter.addAttribute(CharTermAttribute.class);
StringBuilder buf = new StringBuilder();
try {
filter.reset();
while(filter.incrementToken()){
if(termAtt.length()>10){
continue;
}
String word = new String(termAtt.buffer(), 0, termAtt.length());
Matcher matcher = alphabets.matcher(word);
if(matcher.matches()){
buf.append(word).append(" ");
}
}
} catch (IOException e) {
e.printStackTrace();
}
source = new WhitespaceTokenizer(version, new StringReader(buf.toString()));
return new TokenStreamComponents(source, filter);
}
}
答案 0 :(得分:0)
不太确定为什么你有IllegalStateException
,但有一些可能的可能性。通常,您的分析器将在标记生成器之上构建过滤器。你这样做,然后创建另一个标记化器并将其传回,因此传回的过滤器与标记化器没有直接关系。此外,您构建的过滤器在传回时已经结束了,所以您可以试试reset
,我想。
但主要问题是createComponents
并不是实现解析逻辑的好地方。您可以在此处设置Tokenizer和Filter堆栈来执行此操作。在过滤器中实现自定义过滤逻辑更有意义,扩展TokenStream
(或AttributeSource
或其他类似内容。)
我认为您正在寻找的内容已经在PatternReplaceCharFilter
:
private final Pattern nonAlpha = Pattern.compile(".*[^a-z].*");
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new StandardTokenizer(version, reader);
TokenStream filter = new StandardFilter(version, source);
filter = new LowerCaseFilter(version, filter);
filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
filter = new PatternReplaceCharFilter(nonAlpha, "", filter);
return new TokenStreamComponents(source, filter);
}
或者像这样更简单的东西可以起作用:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new LowerCaseTokenizer(version, reader);
TokenStream filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
return new TokenStreamComponents(source, filter);
}