Tokenizer和Filter:在弹性搜索设置中拆分数据

时间:2015-01-12 10:55:03

标签: curl elasticsearch

我使用以下设置拆分我的索引中的字符串。

   {
  "settings": {
    "analysis": {
      "filter": {
        "filter_stop_word": {
          "type": "stop"
        },
        "custom_unique": {
          "type": "unique"
        },
        "custom_shingle": {
          "type": "shingle",
          "token_separator": "",
          "max_shingle_size": "3",
          "filler_token": ""
        },
        "filter_word_delimiter": {
          "type": "word_delimiter"
        }
      },
      "analyzer": {
        "en_us": {
          "filter": [
            "filter_stop_word",
            "filter_word_delimiter",
            "custom_shingle",
            "lowercase",
            "unique"
          ],
          "tokenizer": "standard"
        }
      }
    }
  }
}

输入:“Treeviewcontrol是工具之一”

如果我将上述输入提供给我的设置,它将产生以下输出:

[tree,treeview,treeviewcontrol,view,viewcontrol,Viewcontrolone,controlone,tool]

但我的要求输出如下 - 树, 视图, 控制, 树视图, viewcontrol, 一, 工具,

请勿在空格标记后加入。 任何人帮我?

1 个答案:

答案 0 :(得分:0)

使用驼峰案例标记器,您可以根据案例 -

打破标记
curl -XPUT localhost:9200/test/  -d '{
   "settings" : {
      "analysis" : {
         "filter" : {
            "camelFilter" : {
               "type" : "pattern_capture",
               "preserve_original" : 0,
               "patterns" : [
                  "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                  "(\\d+)"
               ]
            }
         },
         "analyzer" : {
            "camel" : {
               "tokenizer" : "pattern",
               "filter" : [ "camelFilter", "lowercase" ]
            }
         }
      }
   }
}'

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d 'qboxElasticsearchServiceProvider'
{
  "tokens" : [ {
    "token" : "qbox",
    "start_offset" : 0,
    "end_offset" : 32,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "elasticsearch",
    "start_offset" : 0,
    "end_offset" : 32,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "service",
    "start_offset" : 0,
    "end_offset" : 32,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "provider",
    "start_offset" : 0,
    "end_offset" : 32,
    "type" : "word",
    "position" : 1
  } ]
}

LINK - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html