Question

我有一个带有此类文档的ES索引：

from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8

我需要一个查询，该查询仅在from和to值的组合与先前看到的具有相同from值的文档不同的情况下才返回文档。

因此，使用上面提供的示例：

带有timestamp_1的文档应该出现在结果中，因为不存在带有from_1 + to_1组合的早期文档
带有timestamp_2的文档必须被跳过，因为其from + to的组合与最后看到的带有from = from_1的文档完全相同

timestamp_3

文档应该出现在结果中，因为其to字段（to_2）与使用相同from（{{ 1}}文档中包含to_1
带有timestamp_1的文档应该在结果中
带有timestamp_4的文档不能出现在结果中，因为它与从上一次看到的timestamp_5（带有from_1的文档）具有相同的from + to组合。
带有timestamp_3的文档不能出现在结果中，因为它与从上一次看到的timestamp_6（带有from_2的文档）具有相同的from + to组合。
带有timestamp_4的文档应该出现在结果中，因为它具有从{{1}到最后看到的{+ {to}}到最后看到的timestamp_7文档的不同组合
带有from_1的文档应该在结果中，因为到目前为止，其组合是全新的

我需要从索引中获取所有这些“半唯一”文档，因此，如果可以使用timestamp_3请求，或者如果使用聚合，则可以使用timestamp_8，这将是很好的选择。

有什么办法解决吗？

Answer 1

我能想到的最接近的是以下内容（如果它不能处理您的数据，请告诉我。）

{
  "size": 0,
  "aggs": {
    "from_and_to": {
      "composite" : {
        "size": 5,
        "sources": [
          {
            "from_to_collected":{
              "terms": {
                "script": {
                  "lang": "painless",
                  "source": "doc['from'].value + '_' + doc['to'].value"
                }
              }
            }
          }]
      },
      "aggs": {
        "top_from_and_to_hits": {
          "top_hits": {
            "size": 1,
            "sort": [{"timestamp":{"order":"asc"}}],
            "_source": {"includes": ["_id"]}
          }
        }
      }
    }
  }
}

请记住terms aggregations is probabilistic。

这将允许您通过from_to_collected键滚动到下一组存储桶。

Elasticsearch：仅在值更改时获取文档

1 个答案: