Question

在Elasticsearch中，我试图计算数据集中 distinct 字段值的数量，其中字段值为：

从某种意义上说，我试图计算重复发生的频率。我怎么能这样做？

假设我有以下Elasticsearch文档：

{ "myfield": "bob" }
{ "myfield": "bob" }
{ "myfield": "alice" }
{ "myfield": "eve" }
{ "myfield": "mallory" }

由于“alice”，“eve”和“mallory”出现一次，“bob”出现两次，我希望：

number_of_values_that_appear_once: 3
number_of_values_that_appear_twice_or_more: 1

我可以通过terms aggregations获取部分内容并查看每个存储桶的doc_count。 myfield上的术语聚合的输出类似于：

"buckets": [
  {
    "key": "bob",
    "doc_count": 3
  },
  {
    "key": "alice",
    "doc_count": 1
  },
  ...
]

从这个输出中，我可以将doc_count == 1中的桶数加起来。但这不会扩展，因为我经常有数千个不同的值，因此存储桶列表将是巨大的。

Answer 1

您可以通过基于 scripted_metric 的解决方案计算重复项。文章“Accurate Distinct Count and Values from Elasticsearch”中解释了类似的解决方案。您需要做的就是修改解决方案查询以计算唯一值的每次出现次数，而不是计算唯一值本身。

Answer 2

汇总会受到您的查询的影响，因此，如果您想查找重复项，请运行以下查询：

Context c = /*get your context here*/;
File path = new File(c.getFilesDir().getPath() + "/folder1/folder2/"); //this line changes
path.mkdirs();

ps1：{ "size": 0, "query": { "match_all": {} }, "aggregations": { "YOUR_AGGREGATION_NAME": { "terms": { "field": "myfield" } } } }键只是省略了结果/命中（总数除外）。

ps2：size键匹配索引中的所有文档。