如何将标准令牌生成器与preserve_original一起使用?

时间:2018-10-19 11:28:11

标签: elasticsearch

我创建了2个自定义分析器,如下所示,但两者均无法按我的要求工作。 这是我想要的倒排索引 例如;对于单词reb-tn2000xxxl,我需要 reb,tn2000xxl和reb-tn2000xxxl在我的反向索引中。

{  
   "analysis":{  
      "filter":{  
         "my_word_delimiter":{  
            "split_on_numerics":"true",
            "generate_word_parts":"true",
            "preserve_original":"true",
            "generate_number_parts":"true",
            "catenate_all":"true",
            "split_on_case_change":"true",
            "type":"word_delimiter"
         }
      },
      "analyzer":{  
         "my_analyzer":{  
            "filter":[  
               "standard",
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"whitespace"
         },
         "standard_caseinsensitive":{  
            "filter":[  
               "standard",
               "lowercase"
            ],
            "type":"custom",
            "tokenizer":"keyword"
         },
         "my_delimiter":{  
            "filter":[  
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"standard"
         }
      }
   }
}

如果我使用实现my_analyzer令牌生成器的whitespace,则如果我使用curl进行检查,结果如下所示

  curl -XGET "index/_analyze?analyzer=my_analyzer&pretty=true" -d "reb-tn2000xxxl"
{
  "tokens" : [ {
    "token" : "reb-tn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "reb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "rebtn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "tn",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2000",
    "start_offset" : 6,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "xxxl",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "word",
    "position" : 3
  } ]
}

所以在这里我缺少tn2000xxxl拆分,如果我使用standard标记器而不是whitespace可以得到拆分,但是问题是一旦我使用像my_delimiter这样的标准自定义分析器,使用。我在倒排索引中没有原始值。似乎standard缩放和preserve_original过滤器无法正常工作。我在某处读到,因为标准令牌生成器在应用过滤器之前已经在原始对象上拆分,所以这就是为什么原始对象不再是相同的原因。但是如何实现此任务以防止像标准令牌生成器一样拆分原始文档?

curl -XGET "index/_analyze?analyzer=my_delimiter&pretty=true" -d "reb-tn2000xxxl"
{  
   "tokens":[  
      {  
         "token":"reb",
         "start_offset":0,
         "end_offset":3,
         "type":"<ALPHANUM>",
         "position":0
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn",
         "start_offset":4,
         "end_offset":6,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"2000",
         "start_offset":6,
         "end_offset":10,
         "type":"<ALPHANUM>",
         "position":2
      },
      {  
         "token":"xxxl",
         "start_offset":10,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":3
      }
   ]
}

1 个答案:

答案 0 :(得分:1)

在Elasticsearch中,您可以在映射上包含多个字段。您描述的行为实际上很常见。您可以使用text分析器和standard字段来分析主keyword字段。这是文档中使用多字段的映射示例。 https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

在此示例中,将使用"city"分析器来分析standard字段,而"city.raw"将是未经分析的keyword。换句话说,"city.raw"是原始字符串。