如何在lucene 6中基于空格生成复合令牌?

时间:2016-10-18 12:55:38

标签: java indexing lucene tokenize

我有一个像这样的lucene条目:

"心率加快"

当我遇到文字"增加心率"我想在索引中匹配此条目。这意味着我需要将输入标记为:

{increased, heart, rate}
{increasedheart, rate}
{increased, heartrate}

如何使用lucene 6 +?

亲切的问候

1 个答案:

答案 0 :(得分:0)

以下是我做过的方式,请接受建议:

  public class MyAnalyzer extends Analyzer {


  public MyAnalyzer() {
    super();
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName) {

    final Tokenizer src = new WhitespaceTokenizer();
    TokenStream tok = new LowerCaseFilter(src);
    tok = new HyphenatedWordsFilter(tok);
    tok = getStopFilter(tok);
    ShingleFilter filter = new ShingleFilter(tok, 2);
    filter.setTokenSeparator("");
    tok = filter;

    return new TokenStreamComponents(src, tok) {
      @Override
      protected void setReader(final Reader reader) {
        super.setReader(reader);
      }
    };
  }

}

注意ShingleFilter,并使用令牌分隔符设置方法。