有什么方法可以在弹性搜索中禁用对文本中数字的模糊搜索?

时间:2019-01-14 13:21:16

标签: elasticsearch

我有几个字符串,例如:

1. 'any text marium malik 127'
2. 'other text marium malik 1.7 other text'
3. 'marium malik 1 7' etc. 
4. 'any other text only'

映射:

'terms' => ['type' => 'text', 'analyzer' => 'new_analyzer']

 'new_analyzer' =>
                     [
                       'tokenizer' => 'standard',
              'filter' => [
                'word_delimiter', 'lowercase', 
               'shingles_2_3',  'remove_space',
                            ]
                        ],

如果启用模糊性并将其设置为“自动”,并且搜索“ marium malik 127”,由于模糊性,我也将第二个和第三个字符串作为搜索结果,尽管我不希望这样做。有什么方法可以禁用数字的模糊性?

完整映射:

 'body' => [
            'settings' =>
            [

                'analysis' =>
                [                    
                    'analyzer' =>
                    [
                        "extract_number_analyzer" => [
                            "tokenizer" => "standard",
                            "filter" => ["extract_numbers", "decimal_digit"]
                        ],

 'new_analyzer' =>
                        [
                            'tokenizer' => 'standard',
                            'filter' => [
                                'word_delimiter', 'lowercase', 'word_combination', 'length2', 'remove_space',
                            ]
                        ]],
 'filter' =>
                    [
                        'word_combination' => [
                            'type' => 'shingle',
                            'min_shingle_size' => 2,
                            'max_shingle_size' => 3,
                            'output_unigrams' => true
                        ],
                        "extract_numbers" => [
                            "type" => "keep_types",
                            "types" => ["<NUM>"]
                        ],
                        'remove_space' =>
                        [
                            'type' => 'pattern_replace',
                            'pattern' => ' ',
                            'replacement' => ''
                        ],
                        'length2' =>
                        [
                            'type' => 'length',
                            'min' => '3'
                        ]
                    ]
]

  'mappings' =>
            [
                '_doc' =>
                [
 'terms' => ['type' => 'text', 'analyzer' => "new_analyzer", " 
 fields" => ["extracted_number" => ["type" => "text",
                                     "analyzer" => "extract_number_analyzer"
                                ]]]
]

1 个答案:

答案 0 :(得分:0)

您可以使用keep type token仅将数字标记保留在子字段中

分析仪示例:

PUT /keep_types_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "extract_number_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["extract_numbers", "decimal_digit"]
                }
            },
            "filter" : {
                "extract_numbers" : {
                    "type" : "keep_types",
                    "types" : [ "<NUM>" ]
                }
            }
        }
    }
}

然后在映射中

...
{
  terms: {
    type: "text",
    analyzer: "new_analyzer",
    fields: {
      extracted_number: {
        type: "text",
        analyzer: "extract_number_analyzer"
      }
    }
  }
}
...

然后,在查询时,您可以在查询中添加一个子句以在不模糊的情况下与数字子字段匹配,然后仅在数字完全匹配且文本内容与模糊性匹配时才匹配文档。

查询示例:

{
  query: {
    bool: {
      must: [
        {
          match: {
            "terms": {
              "query": "marium malik 127",
              "fuziness": "auto"
            }
          }
        },
        {
           match: {
            "terms.extracted_number": { // or whatever you subfield name is
              "query": "marium malik 127",
              "zero_terms_query": "all" // to match if no extracted number
            }
          }
        }
      ]
    }
  }
}