我有以下代码
static class TaggerAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("al"), new CharsRef("americanleague"), true);
builder.add(new CharsRef("al"), new CharsRef("a.l."), true);
builder.add(new CharsRef("nba"), new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), true);
SynonymMap mySynonymMap = null;
try {
mySynonymMap = builder.build();
} catch (IOException e) {
e.printStackTrace();
}
Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new SynonymFilter(filter, mySynonymMap, true);
return new TokenStreamComponents(source, filter);
}
}
我正在进行一些测试,到目前为止,一切正常,直到我弄清楚这种情况。
String title = "Very short title at a.l. bla bla"
Assert.assertTrue(TagUtil.evaluate(memoryIndex,"americanleague"));
Assert.assertTrue(TagUtil.evaluate(memoryIndex,"al"));
我原以为两个案例都成功运行,但是美国联盟与“a.l”不匹配。除了“a.l.”和“americanleague”是“al”的同义词。
那么,我该怎么办?我不想将所有组合添加到地图中。感谢
答案 0 :(得分:0)
我相信你有builder.add
倒退的论点。尝试:
builder.add(new CharsRef("americanleague"), new CharsRef("al"), true);
builder.add(new CharsRef("a.l."), new CharsRef("al"), true);
builder.add(new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), new CharsRef("nba"), true);
SynonymFilter
只是从第一个arg(输入)映射到第二个arg(输出),而不是相反。因此,你有规则将“al”翻译成两个不同的同义词,但没有任何一个对“a.l”的输入做任何事情。或“americanleague”。