Question

Elasticsearch's Word Delimiter filter有一个非常有用的选项type_table;它允许您将其他特殊字符转换为令牌的合法字符。

但是，它的记录很少：

type_table
A custom type mapping table, for example (when configured using type_table_path):
    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \\u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D => ALPHANUM

从该示例中，我们可以看出DIGIT和ALPHANUM是我们可以映射字符的有效选项。还有哪些其他选择，他们做了什么？

Answer 1

我通过深入研究Lucene文档找到了答案，Elasticsearch基本上是从中引用的。

与WordDelimiterFilterFactory关联的this file in the Subversion repository文档。它受到Elasticsearch文档的大量引用，但包含了这个额外的片段：

WordDelimiterFilterFactory的自定义类型映射允许的类型是：LOWER，UPPER，ALPHA，DIGIT，ALPHANUM，SUBWORD_DELIM

Elasticsearch的单词分隔符过滤器类型表的有效类型是什么？

1 个答案: