Question

我有一些属于少数作者的文件：

[
  { id: 1, author_id: 'mark', content: [...] },
  { id: 2, author_id: 'pierre', content: [...] },
  { id: 3, author_id: 'pierre', content: [...] },
  { id: 4, author_id: 'mark', content: [...] },
  { id: 5, author_id: 'william', content: [...] },
  ...
]

我想根据作者的ID检索并分页最佳匹配文档：

[
  { id: 1, author_id: 'mark', content: [...], _score: 100 },
  { id: 3, author_id: 'pierre', content: [...], _score: 90 },
  { id: 5, author_id: 'william', content: [...], _score: 80 },
  ...
]

这是我目前正在做的事情（伪代码）：

unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }

分页问题是对的：如何选择20个“不同”的文件？

有些人指的是term facets，但我实际上并没有做标签云：

谢谢，
硐

Answer 1

目前ElasticSearch does not provide a group_by equivalent，这是我尝试手动完成的虽然ES社区正在努力直接解决这个问题（可能是一个插件），但这是一个适合我需要的基本尝试。

假设。

我正在寻找相关内容
我认为前300个文档是相关的，所以我考虑一下将我的研究限制在这个选择中，无论是多少还是一些这些来自同一些作者。
根据我的需要，我并没有“真的”需要完全分页，这已经足够了通过ajax更新“显示更多”按钮。

缺点

结果不准确
因为我们每次需要300个文档，我们不知道会有多少独特的文档出现（可能是来自同一作者的300个文档！）。您应该了解它是否符合每位作者的平均文档数量，并且可能会考虑限制。
您需要进行2次查询（等待远程通话费用）：
- 第一个查询只询问300个相关文档：id＆amp; AUTHOR_ID
- 在第二个查询中检索分页ID的完整文档

这是一些ruby伪代码：https://gist.github.com/saxxi/6495116

Answer 2

现在＆＃39; group_by＆＃39;问题已更新，您可以使用elastic 1.3.0 #6124中的此功能。

如果您搜索以下查询，

{
    "aggs": {
        "user_count": {
            "terms": {
                "field": "author_id",
                "size": 0
            }
        }
    }
}

你会得到结果

{
  "took" : 123,
  "timed_out" : false,
  "_shards" : { ... },
  "hits" : { ... },
  "aggregations" : {
    "user_count" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "mark",
        "doc_count" : 87350
      }, {
        "key" : "pierre",
        "doc_count" : 41809
      }, {
        "key" : "william",
        "doc_count" : 24476
      } ]
    }
  }
}

选择与elasticsearch不同

2 个答案: