Elasticsearch提供重复的结果

时间:2016-05-08 19:10:57

标签: elasticsearch aggregate

我有以下搜索/城市索引,其中元素将具有名称和一堆其他属性。我执行以下聚合搜索:

{
"size": 0,
"query": {
    "multi_match" : {
        "query": "ana",
        "fields": [ "cityName" ],
        "type" : "phrase_prefix"
    }
},
"aggs": {
    "res": {
        "terms": {
            "field": "cityName"
        },
        "aggs":{
            "dedup_docs":{
                "top_hits":{
                    "size":1
                }
            }
        }            
    }
}
}

结果我获得了3个带有“Anahiem”,“ana”和“santa”键的桶。以下是结果:

"buckets": [
    {
      "key": "anaheim",
      "doc_count": 11,
      "dedup_docs": {
        "hits": {
          "total": 11,
          "max_score": 5.8941016,
          "hits": [
            {
              "_index": "search",
              "_type": "City",
              "_id": "310",
              "_score": 5.8941016,
              "_source": {
                "id": 310,
                "country": "USA",
                "stateCode": "CA",
                "stateName": "California",
                "cityName": "Anaheim",
                "postalCode": "92806",
                "latitude": 33.822738,
                "longitude": -117.881633
              }
            }
          ]
        }
      }
    },
    {
      "key": "ana",
      "doc_count": 4,
      "dedup_docs": {
        "hits": {
          "total": 4,
          "max_score": 2.933612,
          "hits": [
            {
              "_index": "search",
              "_type": "City",
              "_id": "154",
              "_score": 2.933612,
              "_source": {
                "id": 154,
                "country": "USA",
                "stateCode": "CA",
                "stateName": "California",
                "cityName": "Santa Ana",
                "postalCode": "92706",
                "latitude": 33.767371,
                "longitude": -117.868255
              }
            }
          ]
        }
      }
    },
    {
      "key": "santa",
      "doc_count": 4,
      "dedup_docs": {
        "hits": {
          "total": 4,
          "max_score": 2.933612,
          "hits": [
            {
              "_index": "search",
              "_type": "City",
              "_id": "154",
              "_score": 2.933612,
              "_source": {
                "id": 154,
                "country": "USA",
                "stateCode": "CA",
                "stateName": "California",
                "cityName": "Santa Ana",
                "postalCode": "92706",
                "latitude": 33.767371,
                "longitude": -117.868255
              }
            }
          ]
        }
      }
    }
]

问题是为什么最后一桶有钥匙“圣诞老人”,即使我搜索“ana”,为什么同一个城市“Santa Ana”(id = 154)出现在2个不同的桶中(关键“ana”和关键“圣诞老人” “)?

2 个答案:

答案 0 :(得分:1)

<强>更新

重复是top_hits聚合的行为。

检查一下好的教程:

https://www.elastic.co/blog/top-hits-aggregation

  

当单独使用top_hits聚合时,它只是重复是什么   已经在回复的常规点击中。

实际上分析与它无关。所以下面的阐述是不正确的。

在默认设置中,Elasticsearch会将输入拆分为所谓的术语。默认分析器会将Santa Ana转换为2个术语,如[santaana]。搜索ana Santa Ana时结束也将匹配。 您可以从这里了解Elastichsearch的工作原理: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

答案 1 :(得分:1)

这主要是因为您的cityName字段已被分析,因此,当Santa Ana被编入索引时,会生成两个令牌santaana用于划分。

如果您想阻止您需要像这样定义cityName字段:

PUT search
{
    "mappings": {
        "City": {
            "properties": {
                "cityName": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }
    }
}

首先需要擦除索引,使用上面的映射重新创建它,然后重新索引数据。只有这样,您才能将您的广告位名称设为AnaheimSanta Ana

<强>更新

如果您希望对cityName进行分析,但只在聚合中获得一个存储桶,则可以通过定义multi-field来实现,其中一部分进行分析而另一部分不进行分析,像这样

PUT search
{
    "mappings": {
        "City": {
            "properties": {
                "cityName": {
                    "type": "string",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    }
}

因此,您需要对cityName进行分析,但现在您还有cityName.raw未经分析,您可以在聚合中使用,如下所示:

    "terms": {
        "field": "cityName.raw"
    },