Elasticsearch's Word Delimiter filter有一个非常有用的选项type_table
;它允许您将其他特殊字符转换为令牌的合法字符。
但是,它的记录很少:
type_table
A custom type mapping table, for example (when configured using type_table_path):
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
从该示例中,我们可以看出DIGIT
和ALPHANUM
是我们可以映射字符的有效选项。还有哪些其他选择,他们做了什么?
答案 0 :(得分:3)
我通过深入研究Lucene文档找到了答案,Elasticsearch基本上是从中引用的。
与WordDelimiterFilterFactory关联的this file in the Subversion repository文档。它受到Elasticsearch文档的大量引用,但包含了这个额外的片段:
WordDelimiterFilterFactory的自定义类型映射 允许的类型是:LOWER,UPPER,ALPHA,DIGIT,ALPHANUM,SUBWORD_DELIM