Question

假设我具有以下两个值：

{"name": "foo", "value": 0} 
{"name": "foo", "value": 7}

如果两个或多个文档具有相同的名称而不是全部相同，即在SQL中，我只希望将最后一个文档添加到索引中：即在SQL中：SELECT DISTINCT name FROM test_data

我已经尝试了几件事，例如：

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

但它返回

"aggregations": {
        "duplicateCount": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "foo",
                    "doc_count": 2,
                    "duplicateDocuments": {
                        "hits": {
                            "total": 2,
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VYHNtmQB8mCEn5EB8msO",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 7
                                    }
                                },
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VIHNtmQB8mCEn5EB5Wum",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 2
                                    }
                                }
                            ]
                        }
                    }
                }
            ]
        }
    }

除此之外，当我寻找解决方案时，我只会找到“我要寻找的是如何获取不同值列表”或“如何计算多少个不同值”。

如果没有与Elasticsearch相关的解决方案，我正在考虑循环结果以检查是否已经存在同名结果，但这很耗时。有什么想法吗？

Answer 1

您可以尝试使用Py_Finalize聚合的size参数。考虑以下查询：

top_hits

在POST /my_top_hits/doc/_search { "size": 0, "aggs": { "duplicateCount": { "terms": { "field": "name.keyword", "min_doc_count": 1 }, "aggs": { "duplicateDocuments": { "top_hits": { "size": 1 } } } } } }部分中，您会得到什么一击：

top_hits

如何返回添加到索引的最后一个文档？

您可以尝试使用{ ... "aggregations": { "duplicateCount": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "foo", "doc_count": 2, "duplicateDocuments": { "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_top_hits", "_type": "doc", "_id": "_AfNuGQBW4b-XxcaDVib", "_score": 1, "_source": { "name": "foo", "value": 0 } } ] } } } ] } } }聚合的sort参数。假设top_hits是序列号（即越大，文档越新）：

value

这将仅返回一个文档，但与上一个示例不同：

POST /my_top_hits/doc/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 1,
            "sort": [
              {"value": "desc"}
            ]
          }
        }
      }
    }
  }
}

如果没有要排序的字段，则必须添加一个字段：Elasticsearch没有这种功能。它曾经有一个_timestamp字段，但很早以前就已弃用。

我需要将`"aggregations": { "duplicateCount": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "foo", "doc_count": 2, "duplicateDocuments": { "hits": { "total": 2, "max_score": null, "hits": [ { "_index": "my_top_hits", "_type": "doc", "_id": "_QfNuGQBW4b-XxcaOFjC", "_score": null, "_source": { "name": "foo", "value": 7 }, "sort": [ 7 ] } ] } } } ] } }`设置为1吗？

并非如此，"min_doc_count"聚合的"min_doc_count"参数是default value。

希望有帮助！

Elasticsearch：仅返回基于名称字段的DISTINCT匹配项

1 个答案:

如何返回添加到索引的最后一个文档？