Question

我正在玩ES以了解它是否可以涵盖我的大部分场景。我正处于思考如何在SQL中达到某些非常简单的结果的时候。

这是示例

在弹性方面，我有一个索引这个文件

{ "Id": 1,  "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160101,  "BestBeforeDate": 20160102, "BiteBy":"John"}
{ "Id": 2,  "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160102,  "BestBeforeDate": 20160104, "BiteBy":"Mat"}
{ "Id": 3,  "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160103,  "BestBeforeDate": 20160105, "BiteBy":"Mark"}
{ "Id": 4,  "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160104,  "BestBeforeDate": 20160201, "BiteBy":"Simon"}
{ "Id": 5,  "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160112,  "BestBeforeDate": 20160112, "BiteBy":"John"}
{ "Id": 6,  "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160114,  "BestBeforeDate": 20160116, "BiteBy":"Mark"}
{ "Id": 7,  "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160120,  "BestBeforeDate": 20160121, "BiteBy":"Simon"}
{ "Id": 8,  "Fruit": "Kiwi", "BoughtInStore"="Shop", "BoughtDate"=20160121,  "BestBeforeDate": 20160121, "BiteBy":"Mark"}
{ "Id": 8,  "Fruit": "Kiwi", "BoughtInStore"="Jungle", "BoughtDate"=20160121,  "BestBeforeDate": 20160121, "BiteBy":"Simon"}

如果我想知道在不同商店购买的水果有多少人在SQL的特定日期范围内咬一口我就写这样的东西

SELECT 
    COUNT(DISTINCT kpi.Fruit) as Fruits, 
    kpi.BoughtInStore,
    kpi.BiteBy 
FROM 
    (
        SELECT f1.Fruit, f1.BoughtInStore, f1.BiteBy
        FROM FruitsTable f1
        WHERE f1.BoughtDate = (
            SELECT MAX(f2.BoughtDate)
            FROM FruitsTable f2
            WHERE f1.Fruit = f2.Fruit
            and f2.BoughtDate between 20160101 and 20160131
            and (f2.BestBeforeDate between 20160101 and 20160131)
        )
    ) kpi   
GROUP BY kpi.BoughtInStore, kpi.ByteBy

结果是这样的

{ "Fruits": 1,  "BoughtInStore": "Jungle", "BiteBy"="Mark"}
{ "Fruits": 1,  "BoughtInStore": "Shop", "BiteBy"="Mark"}
{ "Fruits": 2,  "BoughtInStore": "Jungle", "BiteBy"="Simon"}

您是否知道如何通过聚合在Elastic中获得相同的结果？

简而言之，我所面临的弹性问题是：

如何在聚合之前准备好数据（如本例中每个Fruit的范围中的最新行）
如何按多个字段对结果进行分组

谢谢

Answer 1

据我所知，没有办法在相同查询的过滤器中引用聚合结果。因此，您只需使用单个查询即可解决部分难题：

GET /purchases/fruits/_search
{
  "query": {
    "filtered":{ 
      "filter": {
        "range": {
          "BoughtDate": {
            "gte": "2015-01-01", //assuming you have right mapping for dates
            "lte": "2016-03-01"
          }
        }
      }
    }
  },
  "sort": { "BoughtDate": { "order": "desc" }},
  "aggs": {
    "byBoughtDate": {
      "terms": {
        "field": "BoughtDate",
        "order" : { "_term" : "desc" }
      },
      "aggs": {
        "distinctCount": {
           "cardinality": {
             "field": "Fruit"
           }
         }
      }
    }
  }
}

因此，您将拥有日期范围内的所有文档，并且您将按照术语排序聚合桶数，因此最大日期将位于顶部。客户端可以解析此第一个存储桶（计数和值），然后获取此日期值的文档。对于不同的水果计数，您只需使用嵌套基数聚合。

是的，查询返回的信息比您需要的多得多，但这就是生命：）

Answer 2

当然，没有从SQL到Elasticsearch DSL的直接路由，但是有一些非常常见的相关性。

对于初学者来说，任何GROUP BY / HAVING都会归结为聚合。查询DSL通常可以覆盖（通常更多）正常的查询语义。

如何在聚合之前准备一份数据（如本例中每个Fruit的范围内的最新行）

所以，你有点要求两件事。

如何在聚合之前准备好数据

这是查询阶段。

（就像在这个例子中每个Fruit的范围中的最新行）

您在技术上要求它聚合以获得此示例的答案：不是正常的查询。在您的示例中，您正在使用MAX来获取有效的内容，使用GROUP BY来获取它。

如何按多个字段对结果进行分组

这取决于。你想要他们分层（通常，是）或者你想要他们在一起吗。

如果你想要它们分层，那么你只需使用子聚合来获得你想要的东西。如果您希望将它们组合在一起，那么通常只需对不同的分组使用filters聚合。

将所有内容重新组合在一起：根据特定的过滤日期范围，您希望每次购买最新产品。日期范围只是普通的查询/过滤器：

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "BoughtDate": {
              "gte": "2016-01-01",
              "lte": "2016-01-31"
            }
          }
        },
        {
          "range": {
            "BestBeforeDate": {
              "gte": "2016-01-01",
              "lte": "2016-01-31"
            }
          }
        }
      ]
    }
  }
}

这样，请求中不会包含任何文档，这些文档不在两个字段的日期范围内（实际上是AND）。因为我使用了一个过滤器，所以它是未编号和可缓存的。

现在，您需要开始聚合以获取其余信息。让我们首先假设使用上面的过滤器过滤了文档，以简化我们正在查看的内容。我们最后将它结合起来。

{
  "size": 0,
  "aggs": {
    "group_by_date": {
      "date_histogram": {
        "field": "BoughtDate",
        "interval": "day",
        "min_doc_count": 1
      },
      "aggs": {
        "group_by_store": {
          "terms": {
            "field": "BoughtInStore"
          },
          "aggs": {
            "group_by_person": {
              "terms": {
                "field": "BiteBy"
              }
            }
          }
        }
      }
    }
  }
}

您希望"size" : 0位于顶级，因为您实际上并不关心点击率。您只需要汇总结果。

您的第一个聚合实际上是按最近的日期进行分组。我改变了一点以使其更加真实（每个日），但它实际上是相同的。使用MAX的方式，我们可以使用terms聚合"size": 1，但这是更真实的到你想要的时候涉及日期（可能是时间！）。我还要求它忽略没有数据的匹配文档中的日期（因为它从开始到结束，我们实际上并不关心那些日子。）

如果确实只想要最后一天，那么您可以使用管道聚合来删除除最大存储桶之外的所有内容，但是这种类型的请求的实际用法需要完整的日期范围。

所以，我们继续按商店分组，这就是你想要的。然后，我们按人分组（BiteBy）。这将隐含地给你计数。

把它们全部重新组合在一起：

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "BoughtDate": {
              "gte": "2016-01-01",
              "lte": "2016-01-31"
            }
          }
        },
        {
          "range": {
            "BestBeforeDate": {
              "gte": "2016-01-01",
              "lte": "2016-01-31"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "group_by_date": {
      "date_histogram": {
        "field": "BoughtDate",
        "interval": "day",
        "min_doc_count": 1
      },
      "aggs": {
        "group_by_store": {
          "terms": {
            "field": "BoughtInStore"
          },
          "aggs": {
            "group_by_person": {
              "terms": {
                "field": "BiteBy"
              }
            }
          }
        }
      }
    }
  }
}

注意：这是我索引数据的方式。

PUT /grocery/store/_bulk
{"index":{"_id":"1"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-01","BestBeforeDate":"2016-01-02","BiteBy":"John"}
{"index":{"_id":"2"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-02","BestBeforeDate":"2016-01-04","BiteBy":"Mat"}
{"index":{"_id":"3"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-03","BestBeforeDate":"2016-01-05","BiteBy":"Mark"}
{"index":{"_id":"4"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-04","BestBeforeDate":"2016-02-01","BiteBy":"Simon"}
{"index":{"_id":"5"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-12","BestBeforeDate":"2016-01-12","BiteBy":"John"}
{"index":{"_id":"6"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-14","BestBeforeDate":"2016-01-16","BiteBy":"Mark"}
{"index":{"_id":"7"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-20","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}
{"index":{"_id":"8"}}
{"Fruit":"Kiwi","BoughtInStore":"Shop","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Mark"}
{"index":{"_id":"9"}}
{"Fruit":"Kiwi","BoughtInStore":"Jungle","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}

严重您希望在（商店和个人）上汇总的字符串值为not_analyzed string s（ES 5.0中为keyword）！否则它将使用所谓的fielddata，这不是一件好事。

在ES 1.x / ES 2.x中，映射看起来像这样：

PUT /grocery
{
  "settings": {
    "number_of_shards": 1
  }, 
  "mappings": {
    "store": {
      "properties": {
        "Fruit": {
          "type": "string",
          "index": "not_analyzed"
        },
        "BoughtInStore": {
          "type": "string",
          "index": "not_analyzed"
        },
        "BiteBy": {
          "type": "string",
          "index": "not_analyzed"
        },
        "BestBeforeDate": {
          "type": "date"
        },
        "BoughtDate": {
          "type": "date"
        }
      }
    }
  }
}

所有这一切，你得到答案：

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_date": {
      "buckets": [
        {
          "key_as_string": "2016-01-01T00:00:00.000Z",
          "key": 1451606400000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "John",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-02T00:00:00.000Z",
          "key": 1451692800000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Mat",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-03T00:00:00.000Z",
          "key": 1451779200000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Mark",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-12T00:00:00.000Z",
          "key": 1452556800000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "John",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-14T00:00:00.000Z",
          "key": 1452729600000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Mark",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-20T00:00:00.000Z",
          "key": 1453248000000,
          "doc_count": 1,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Simon",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "key_as_string": "2016-01-21T00:00:00.000Z",
          "key": 1453334400000,
          "doc_count": 2,
          "group_by_store": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Jungle",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Simon",
                      "doc_count": 1
                    }
                  ]
                }
              },
              {
                "key": "Shop",
                "doc_count": 1,
                "group_by_person": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "key": "Mark",
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Elasticsearch SQL就像子查询聚合一样

2 个答案: