假设我具有以下两个值:
{"name": "foo", "value": 0}
{"name": "foo", "value": 7}
如果两个或多个文档具有相同的名称而不是全部相同,即在SQL中,我只希望将最后一个文档添加到索引中:即在SQL中:SELECT DISTINCT name FROM test_data
我已经尝试了几件事,例如:
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
但它返回
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "test_data",
"_type": "doc",
"_id": "VYHNtmQB8mCEn5EB8msO",
"_score": 1.0,
"_source": {
"name": "foo",
"value": 7
}
},
{
"_index": "test_data",
"_type": "doc",
"_id": "VIHNtmQB8mCEn5EB5Wum",
"_score": 1.0,
"_source": {
"name": "foo",
"value": 2
}
}
]
}
}
}
]
}
}
除此之外,当我寻找解决方案时,我只会找到“我要寻找的是如何获取不同值列表”或“如何计算多少个不同值”。
如果没有与Elasticsearch相关的解决方案,我正在考虑循环结果以检查是否已经存在同名结果,但这很耗时。有什么想法吗?
答案 0 :(得分:2)
您可以尝试使用Py_Finalize
聚合的size
参数。考虑以下查询:
top_hits
在POST /my_top_hits/doc/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 1
}
}
}
}
}
}
部分中,您会得到什么一击:
top_hits
您可以尝试使用{
...
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_top_hits",
"_type": "doc",
"_id": "_AfNuGQBW4b-XxcaDVib",
"_score": 1,
"_source": {
"name": "foo",
"value": 0
}
}
]
}
}
}
]
}
}
}
聚合的sort
参数。
假设top_hits
是序列号(即越大,文档越新):
value
这将仅返回一个文档,但与上一个示例不同:
POST /my_top_hits/doc/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 1,
"sort": [
{"value": "desc"}
]
}
}
}
}
}
}
如果没有要排序的字段,则必须添加一个字段:Elasticsearch没有这种功能。它曾经有一个_timestamp
字段,但很早以前就已弃用。
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "my_top_hits",
"_type": "doc",
"_id": "_QfNuGQBW4b-XxcaOFjC",
"_score": null,
"_source": {
"name": "foo",
"value": 7
},
"sort": [
7
]
}
]
}
}
}
]
}
}
设置为1吗?并非如此,"min_doc_count"
聚合的"min_doc_count"
参数是default value。
希望有帮助!