使用CustomAnalyzer的QueryParser混淆了PatternReplaceCharFilter的使用顺序

时间:2016-04-28 09:26:56

标签: java lucene query-parser

我在lucene 6.0.0中使用org.apache.lucene.queryparser.classic.QueryParser来使用CustomAnalyzer解析查询,如下所示:

public static void testFilmAnalyzer() throws IOException, ParseException {
    CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
            .addCharFilter("patternreplace",
                    "pattern", "(movie|film|picture).*",
                    "replacement", "")
            .withTokenizer("standard")
            .build();

    QueryParser qp = new QueryParser("name", nameAnalyzer);
    qp.setDefaultOperator(QueryParser.Operator.AND);
    String[] strs = {"avatar film fiction", "avatar-film fiction", "avatar-film-fiction"};

    for (String str : strs) {
        System.out.println("Analyzing \"" + str + "\":");
        showTokens(str, nameAnalyzer);
        Query q = qp.parse(str);
        System.out.println("Parsed query of \"" + str + "\":");
        System.out.println(q + "\n");
    }
}

private static void showTokens(String text, Analyzer analyzer) throws IOException {
    StringReader reader = new StringReader(text);
    TokenStream stream = analyzer.tokenStream("name", reader);
    CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
    stream.reset();
    while (stream.incrementToken()) {
        System.out.print("[" + term.toString() + "]");
    }
    stream.close();
    System.out.println();
}

当我调用testFilmAnalyzer时,我得到以下输出:

Analyzing "avatar film fiction":
[avatar]
Parsed query of "avatar film fiction":
+name:avatar +name:fiction

Analyzing "avatar-film fiction":
[avatar]
Parsed query of "avatar-film fiction":
+name:avatar +name:fiction

Analyzing "avatar-film-fiction":
[avatar]
Parsed query of "avatar-film-fiction":
name:avatar

似乎分析器以正确的预期顺序(即在标记化之前)使用PatternReplaceCharFilter,而QueryParser之后就这样做了。有没有人对此有解释?这不是一个错误吗?

1 个答案:

答案 0 :(得分:1)

不,这不是一个错误。无论是在查询时还是索引时,CharFilters都会在标记化之前始终应用。

然而,空格在QueryParser语法中有意义,它完全独立于分析。空格分隔查询的子句,并且每个子句都在其自身上进行分析。如果您不依赖默认字段,则更容易看到,在这种情况下,我们需要将查询重写为avatar-film fiction,以便:name:avatar-film name:fiction。这两个条款中的每一个," avatar-film"和"小说",分别进行分析,从而产生您所看到的结果。

尝试使用词组查询:

String[] strs = {"\"avatar film fiction\"", "\"avatar-film fiction\"", "\"avatar-film-fiction\""};

你应该看到你期待的结果。