构建Lucene同义词

时间:2014-02-27 19:39:54

标签: lucene synonym

我有以下代码

static class TaggerAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {

        SynonymMap.Builder builder = new SynonymMap.Builder(true);
        builder.add(new CharsRef("al"), new CharsRef("americanleague"), true);
        builder.add(new CharsRef("al"), new CharsRef("a.l."), true);
        builder.add(new CharsRef("nba"), new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), true);

        SynonymMap mySynonymMap = null;
        try {
            mySynonymMap = builder.build();
        } catch (IOException e) {
            e.printStackTrace();
        }

        Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
        TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
        filter = new LowerCaseFilter(Version.LUCENE_40, filter);
        filter = new SynonymFilter(filter, mySynonymMap, true);
        return new TokenStreamComponents(source, filter);
    }
}

我正在进行一些测试,到目前为止,一切正常,直到我弄清楚这种情况。

    String title = "Very short title at a.l. bla bla"

    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"americanleague"));
    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"al"));

我原以为两个案例都成功运行,但是美国联盟与“a.l”不匹配。除了“a.l.”和“americanleague”是“al”的同义词。

那么,我该怎么办?我不想将所有组合添加到地图中。感谢

1 个答案:

答案 0 :(得分:0)

我相信你有builder.add倒退的论点。尝试:

builder.add(new CharsRef("americanleague"), new CharsRef("al"), true);
builder.add(new CharsRef("a.l."), new CharsRef("al"), true);
builder.add(new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), new CharsRef("nba"), true);

SynonymFilter只是从第一个arg(输入)映射到第二个arg(输出),而不是相反。因此,你有规则将“al”翻译成两个不同的同义词,但没有任何一个对“a.l”的输入做任何事情。或“americanleague”。