Elasticsearch多场模糊搜索首先不返回精确匹配

时间:2013-08-02 18:56:09

标签: javascript elasticsearch

我正在'text'和'keywords'字段上执行模糊弹性搜索查询。我在elasticsearch中有两个文档,一个是“text”“testPhone 5”,另一个是“testPhone 4s”。当我使用“testPhone 5”执行模糊查询时,我发现这两个文档都被赋予了完全相同的分数值。为什么会这样?

额外信息:我使用'uax_url_email'标记器和'小写'过滤器索引文档。

这是我正在进行的查询:

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

结果如下:

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }

2 个答案:

答案 0 :(得分:2)

未分析模糊查询,但字段是这样的,因此您搜索距离为testphone 5的{​​{1}}会为两个文档生成分析的术语0.4,并且该术语用于进一步过滤结果

描述:'体重(文字:测试电话 ^ 3.8888888 in 0)[PerFieldSimilarity],结果:'},

另见@imotov优秀答案: ElasticSearch's Fuzzy Query

您可以使用testphone API

查看字符串的标记方式

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

_analyze

将返回:

http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

因此,即使您为值{ "tokens": [ { "token": "testphone", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "5", "start_offset": 10, "end_offset": 11, "type": "<NUM>", "position": 2 } ] } 编制索引,“testphone samsunk”的模糊查询也不会产生任何只有testphone sammsung的内容。

通过不分析(或使用关键字分析器)字段,您可以获得更好的结果。

如果您想对单个字段进行不同的分析,可以使用samsunk构造。

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

答案 1 :(得分:0)

我最近自己遇到过这个问题。 我无法确切地告诉你它为什么会发生,但我可以告诉你我是如何解决的:

我在同一个字段上运行了2个查询,其中一个具有完全匹配,然后在同一字段上执行完全相同的查询,启用模糊匹配并降低提升。

确保我的完全匹配总是高于模糊匹配。

P.S。 我认为他们得分是平等的,因为由于模糊性,两者的匹配和ES并不关心只要两者匹配就是一个完全匹配,但这是纯粹的理论制作,因为我不是非常熟悉评分算法。