我正在寻求Lucene 6.1 API的帮助。
我试图扩展Lucene的Tokenizer
和Analyzer
,但我并不了解所有指南。在所有教程中,用户Tokenizer
会覆盖增量。在构造函数中,他们有Reader
个类,而在用户的Analyzer
类中,它们会覆盖createComponents
方法。但是在Lucene中它只有1个String参数,那么如何将Reader添加到我的Analyzer
?
我的代码:
public class ChemTokenizer extends Tokenizer{
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
protected String stringToTokenize;
protected int position = 0;
protected List<int[]> chemicals = new ArrayList<>();
@Override
public boolean incrementToken() throws IOException {
// Clear anything that is already saved in this.charTermAttribute
this.charTermAttribute.setEmpty();
// Get the position of the next symbol
int nextIndex = -1;
Pattern p = Pattern.compile("[^A-zА-я]");
Matcher m = p.matcher(stringToTokenize.substring(position));
nextIndex = m.start();
// Did we lose chemicals?
for (int[] pair: chemicals) {
if (pair[0] < nextIndex && pair[1] > nextIndex) {
//We are in the chemical name
if (position == pair[0]) {
nextIndex = pair[1];
}
else {
nextIndex = pair[0];
}
}
}
// Next separator was found
if (nextIndex != -1) {
String nextToken = stringToTokenize.substring(position, nextIndex);
charTermAttribute.append(nextToken);
position = nextIndex + 1;
return true;
}
// Last part of text
else if (position < stringToTokenize.length()) {
String nextToken = stringToTokenize.substring(position);
charTermAttribute.append(nextToken);
position = stringToTokenize.length();
return true;
}
else {
return false;
}
}
public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
int numChars;
char[] buffer = new char[1024];
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
reader.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
stringToTokenize = stringBuilder.toString();
//Checking for keywords
//Doesnt work properly if text has chemical synonyms
for (String keyword: additionalKeywords) {
int[] tmp = new int[2];
//Start of keyword
tmp[0] = stringToTokenize.indexOf(keyword);
tmp[1] = tmp[0] + keyword.length() - 1;
chemicals.add(tmp);
}
}
/* Reset the stored position for this object when reset() is called.
*/
@Override
public void reset() throws IOException {
super.reset();
position = 0;
chemicals = new ArrayList<>();
}
}
Analyzer
的代码:
public class ChemAnalyzer extends Analyzer{
List<String> additionalKeywords;
public ChemAnalyzer(List<String> ad) {
additionalKeywords = ad;
}
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
问题是此代码不适用于Lucene 6
答案 0 :(得分:0)
这是我在github search中找到的,猜测你必须创建一个没有读取的新的标记器。
@Override
protected TokenStreamComponents createComponents(String fieldName) {
return new TokenStreamComponents(new WhitespaceTokenizer()); }