我试图通过使用脚本按[array] + field聚合用户来查找索引中的重复项。
我的问题是,为什么术语聚合仅按给定键( smith@gmail.com_SMITH )计算一次文档。是否有可能改变这种行为。
数据:
POST users/user
{
"name" :"SMITH",
"emails" : [
"smith@gmail.com"
]
}
POST users/user
{
"name" :"SMITH",
"emails" : [
"mrsmith@gmail.com",
"smith@gmail.com"
]
}
不同的查询:
POST users/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"script": {
"inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value"
}
}
}
}
}
结果:
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mrsmith@gmail.com_SMITH",
"doc_count": 1
},
{
"key": "smith@gmail.com_SMITH",
"doc_count": 1
}
]
}
}
答案 0 :(得分:0)
您似乎只是通过"terms"
+ "field"
获得正确的字词聚合计数。
如果您试用此查询,则可以看到"terms"
+ "field"
和"terms"
+ "script"
之间的差异:
{
"from" : 0,
"size" : 0,
"_source" : true,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"name" : {
"query" : "SMITH",
"operator" : "OR",
"fuzziness" : "AUTO",
"prefix_length" : 1,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1
}
}
} ]
}
},
"aggs": {
"duplicateCount": {
"terms": {
"script": {
"inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value"
}
}
},
"duplicateCount2": {
"terms": {
"field": "emails.keyword"
}
}
}
}
以下是结果。见duplicateCount2
:
{
"took" : 53,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"duplicateCount2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "smith@gmail.com",
"doc_count" : 2
}, {
"key" : "mrsmith@gmail.com",
"doc_count" : 1
} ]
},
"duplicateCount" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "mrsmith@gmail.com_SMITH",
"doc_count" : 1
}, {
"key" : "smith@gmail.com_SMITH",
"doc_count" : 1
} ]
}
}
}
答案 1 :(得分:0)
确定。所以我通过迭代术语数组并手动创建所需的键来解决它:
def keys = [];
for (p in doc['emails.keyword'].values) {
keys.add(p + doc['name.keyword'].value);
}
return keys;
结果如下:
"buckets": [
{
"key": "smith@gmail.com_SMITH",
"doc_count": 2
},
{
"key": "mrsmith@gmail.com_SMITH",
"doc_count": 1
}
]