Elasticsearch:仅返回基于名称字段的DISTINCT匹配项

时间:2018-07-20 08:35:51

标签: elasticsearch

假设我具有以下两个值:

{"name": "foo", "value": 0} 
{"name": "foo", "value": 7} 

如果两个或多个文档具有相同的名称而不是全部相同,即在SQL中,我只希望将最后一个文档添加到索引中:即在SQL中:SELECT DISTINCT name FROM test_data

我已经尝试了几件事,例如:

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

但它返回

"aggregations": {
        "duplicateCount": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "foo",
                    "doc_count": 2,
                    "duplicateDocuments": {
                        "hits": {
                            "total": 2,
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VYHNtmQB8mCEn5EB8msO",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 7
                                    }
                                },
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VIHNtmQB8mCEn5EB5Wum",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 2
                                    }
                                }
                            ]
                        }
                    }
                }
            ]
        }
    }

除此之外,当我寻找解决方案时,我只会找到“我要寻找的是如何获取不同值列表”或“如何计算多少个不同值”。

如果没有与Elasticsearch相关的解决方案,我正在考虑循环结果以检查是否已经存在同名结果,但这很耗时。有什么想法吗?

1 个答案:

答案 0 :(得分:2)

您可以尝试使用Py_Finalize聚合的size参数。考虑以下查询:

top_hits

POST /my_top_hits/doc/_search { "size": 0, "aggs": { "duplicateCount": { "terms": { "field": "name.keyword", "min_doc_count": 1 }, "aggs": { "duplicateDocuments": { "top_hits": { "size": 1 } } } } } } 部分中,您会得到什么一击:

top_hits

如何返回添加到索引的最后一个文档?

您可以尝试使用{ ... "aggregations": { "duplicateCount": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "foo", "doc_count": 2, "duplicateDocuments": { "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_top_hits", "_type": "doc", "_id": "_AfNuGQBW4b-XxcaDVib", "_score": 1, "_source": { "name": "foo", "value": 0 } } ] } } } ] } } } 聚合的sort参数。 假设top_hits是序列号(即越大,文档越新):

value

这将仅返回一个文档,但与上一个示例不同:

POST /my_top_hits/doc/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 1,
            "sort": [
              {"value": "desc"}
            ]
          }
        }
      }
    }
  }
}

如果没有要排序的字段,则必须添加一个字段:Elasticsearch没有这种功能。它曾经有一个_timestamp字段,但很早以前就已弃用。

我需要将 "aggregations": { "duplicateCount": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "foo", "doc_count": 2, "duplicateDocuments": { "hits": { "total": 2, "max_score": null, "hits": [ { "_index": "my_top_hits", "_type": "doc", "_id": "_QfNuGQBW4b-XxcaOFjC", "_score": null, "_source": { "name": "foo", "value": 7 }, "sort": [ 7 ] } ] } } } ] } } 设置为1吗?

并非如此,"min_doc_count"聚合的"min_doc_count"参数是default value

希望有帮助!