Elasticsearch:时间范围聚合未按预期工作

时间:2016-11-03 03:09:54

标签: python elasticsearch aggregate aggregate-functions elasticsearch-aggregation

我是弹性搜索领域的新手。我正在学习并尝试检查它是否符合我的需求。

现在我正在学习elasticsearch中的聚合,我编写了以下python脚本,将一些时间序列数据摄取到elasticsearch中。

每隔5秒我创建一条新消息:

  1. 时间戳(ISO8601格式)
  2. 计数器
  3. 0到100之间的随机数
  4. 对于每一个新的一天,我都会创建一个以logs_Y-m-D作为索引名称的新索引。

    我将使用Counter消息为_id索引每条消息。计数器会为每个新索引重置(每天)。

    import csv
    import time
    import random
    from datetime import datetime
    from elasticsearch import Elasticsearch
    
    
    class ElasticSearchDB:
        def __init__(self):
            self.es = Elasticsearch()
    
        def run(self):
            print("Started: {}".format(datetime.now().isoformat()))
            print("<Ctrl + c> for exit!")
    
            with open("..\\out\\logs.csv", "w", newline='') as f:
                writer = csv.writer(f)
                counter = 0
                try:
                    while True:
                        i_name = "logs_" + time.strftime("%Y-%m-%d")
                        if not self.es.indices.exists([i_name]):
                            self.es.indices.create(i_name, ignore=400)
                            print("New index created: {}".format(i_name))
                            counter = 0
    
                        message = {"counter": counter, "@timestamp": datetime.now().isoformat(), "value": random.randint(0, 100)}
                        # Write to file
                        writer.writerow(message.values())
                        # Write to elasticsearch index
                        self.es.index(index=i_name, doc_type="logs", id=counter, body=message)
                        # Waste some time
                        time.sleep(5)
                        counter += 1
    
                except KeyboardInterrupt:
                    print("Stopped: {}".format(datetime.now().isoformat()))
    
    
    test_es = ElasticSearchDB()
    test_es.run()
    

    我运行此脚本 30分钟。接下来,使用Sense,我使用以下聚合查询来查询elasticsearch。

    查询#1:全部获取

    查询#2:汇总最近1小时的日志并为其生成统计信息。这显示了正确的结果。

    查询#3:汇总最近1分钟的日志并为其生成统计信息。汇总的文档数与1小时聚合中的文档数相同,理想情况下,它应仅汇总 12-13 log

    查询#4:汇总最近15秒的日志并为其生成统计信息。聚合的文档数与1小时聚合中的文档数相同,理想情况下,它应仅聚合 3-4个日志

    我的问题:

    1. 为什么elasticsearch无法理解1分15秒 范围?
    2. 我理解映射,但我不知道如何写一个,所以我没有写一个,是什么导致了这个问题?
    3. 请帮忙!

      查询#1:全部获取

      GET /_search
      

      输出:

      {
         "took": 3,
         "timed_out": false,
         "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
         },
         "hits": {
            "total": 314,
            "max_score": 1,
            "hits": [
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "19",
                  "_score": 1,
                  "_source": {
                     "counter": 19,
                     "value": 62,
                     "@timestamp": "2016-11-03T07:40:35.981395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "22",
                  "_score": 1,
                  "_source": {
                     "counter": 22,
                     "value": 95,
                     "@timestamp": "2016-11-03T07:40:51.066395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "25",
                  "_score": 1,
                  "_source": {
                     "counter": 25,
                     "value": 18,
                     "@timestamp": "2016-11-03T07:41:06.140395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "26",
                  "_score": 1,
                  "_source": {
                     "counter": 26,
                     "value": 58,
                     "@timestamp": "2016-11-03T07:41:11.164395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "29",
                  "_score": 1,
                  "_source": {
                     "counter": 29,
                     "value": 73,
                     "@timestamp": "2016-11-03T07:41:26.214395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "41",
                  "_score": 1,
                  "_source": {
                     "counter": 41,
                     "value": 59,
                     "@timestamp": "2016-11-03T07:42:26.517395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "14",
                  "_score": 1,
                  "_source": {
                     "counter": 14,
                     "value": 9,
                     "@timestamp": "2016-11-03T07:40:10.857395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "40",
                  "_score": 1,
                  "_source": {
                     "counter": 40,
                     "value": 9,
                     "@timestamp": "2016-11-03T07:42:21.498395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "24",
                  "_score": 1,
                  "_source": {
                     "counter": 24,
                     "value": 41,
                     "@timestamp": "2016-11-03T07:41:01.115395"
                  }
               },
               {
                  "_index": "logs_2016-11-03",
                  "_type": "logs",
                  "_id": "0",
                  "_score": 1,
                  "_source": {
                     "counter": 0,
                     "value": 79,
                     "@timestamp": "2016-11-03T07:39:00.302395"
                  }
               }
            ]
         }
      }
      

      查询#2:获取最近1小时的统计信息。

      GET /logs_2016-11-03/logs/_search?search_type=count
      {
          "aggs": {
              "time_range": {
                  "filter": {
                      "range": {
                          "@timestamp": {
                              "from": "now-1h"
                          }
                      }
                  },
                  "aggs": {
                      "just_stats": {
                          "stats": {
                              "field": "value"
                          }
                      }
                  }
              }
          }
      }
      

      输出:

      {
         "took": 5,
         "timed_out": false,
         "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
         },
         "hits": {
            "total": 366,
            "max_score": 0,
            "hits": []
         },
         "aggregations": {
            "time_range": {
               "doc_count": 366,
               "just_stats": {
                  "count": 366,
                  "min": 0,
                  "max": 100,
                  "avg": 53.17213114754098,
                  "sum": 19461
               }
            }
         }
      }
      

      我得到了366个条目,这是正确的。

      查询#3:获取最近1分钟的统计信息。

      GET /logs_2016-11-03/logs/_search?search_type=count
      {
          "aggs": {
              "time_range": {
                  "filter": {
                      "range": {
                          "@timestamp": {
                              "from": "now-1m"
                          }
                      }
                  },
                  "aggs": {
                      "just_stats": {
                          "stats": {
                              "field": "value"
                          }
                      }
                  }
              }
          }
      }
      

      输出:

      {
         "took": 15,
         "timed_out": false,
         "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
         },
         "hits": {
            "total": 407,
            "max_score": 0,
            "hits": []
         },
         "aggregations": {
            "time_range": {
               "doc_count": 407,
               "just_stats": {
                  "count": 407,
                  "min": 0,
                  "max": 100,
                  "avg": 53.152334152334156,
                  "sum": 21633
               }
            }
         }
      }
      

      这是错误的,它不能在最后1分钟内有407个条目,它应该只有12-13个日志。

      查询#4:获取最近15秒的统计信息。

      GET /logs_2016-11-03/logs/_search?search_type=count
      {
          "aggs": {
              "time_range": {
                  "filter": {
                      "range": {
                          "@timestamp": {
                              "from": "now-15s"
                          }
                      }
                  },
                  "aggs": {
                      "just_stats": {
                          "stats": {
                              "field": "value"
                          }
                      }
                  }
              }
          }
      }
      

      输出:

      {
         "took": 15,
         "timed_out": false,
         "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
         },
         "hits": {
            "total": 407,
            "max_score": 0,
            "hits": []
         },
         "aggregations": {
            "time_range": {
               "doc_count": 407,
               "just_stats": {
                  "count": 407,
                  "min": 0,
                  "max": 100,
                  "avg": 53.152334152334156,
                  "sum": 21633
               }
            }
         }
      }
      

      这也是错误的,它在最后15秒内不能成为407个条目。它应该只有3-4个日志。

1 个答案:

答案 0 :(得分:2)

您的查询是正确的,但ES以UTC格式存储日期,因此您将获得所有内容。来自documentation

  

在JSON文档中,日期表示为字符串。 Elasticsearch   使用一组预配置格式来识别和解析这些格式   将字符串转换为一个长值,表示自从该纪元开始的毫秒数   的 UTC

您可以使用pytz模块并在ES中以UTC格式存储日期。请参阅this SO问题。

你也可以在范围查询中使用time_zone param,也最好聚合过滤结果,而不是获取所有结果,然后对所有结果进行过滤。

GET /logs_2016-11-03/logs/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "@timestamp": {
            "gte": "2016-11-03T07:15:35",         <----- You would need absolute value
            "time_zone": "-01:00"              <---- timezone setting
          }
        }
      }
    }
  },
  "aggs": {
    "just_stats": {
      "stats": {
        "field": "value"
      }
    }
  },
  "size": 0
}

您必须将所需的时间( now-1m,now-15s )转换为格式yyyy-MM-dd'T'HH:mm:ss for time_zone param,以便 now不受影响按time_zone ,所以最好的选择是将日期转换为UTC并存储它。