我是弹性搜索领域的新手。我正在学习并尝试检查它是否符合我的需求。
现在我正在学习elasticsearch中的聚合,我编写了以下python脚本,将一些时间序列数据摄取到elasticsearch中。
每隔5秒我创建一条新消息:
对于每一个新的一天,我都会创建一个以logs_Y-m-D
作为索引名称的新索引。
我将使用Counter
消息为_id
索引每条消息。计数器会为每个新索引重置(每天)。
import csv
import time
import random
from datetime import datetime
from elasticsearch import Elasticsearch
class ElasticSearchDB:
def __init__(self):
self.es = Elasticsearch()
def run(self):
print("Started: {}".format(datetime.now().isoformat()))
print("<Ctrl + c> for exit!")
with open("..\\out\\logs.csv", "w", newline='') as f:
writer = csv.writer(f)
counter = 0
try:
while True:
i_name = "logs_" + time.strftime("%Y-%m-%d")
if not self.es.indices.exists([i_name]):
self.es.indices.create(i_name, ignore=400)
print("New index created: {}".format(i_name))
counter = 0
message = {"counter": counter, "@timestamp": datetime.now().isoformat(), "value": random.randint(0, 100)}
# Write to file
writer.writerow(message.values())
# Write to elasticsearch index
self.es.index(index=i_name, doc_type="logs", id=counter, body=message)
# Waste some time
time.sleep(5)
counter += 1
except KeyboardInterrupt:
print("Stopped: {}".format(datetime.now().isoformat()))
test_es = ElasticSearchDB()
test_es.run()
我运行此脚本 30分钟。接下来,使用Sense,我使用以下聚合查询来查询elasticsearch。
查询#1:全部获取
查询#2:汇总最近1小时的日志并为其生成统计信息。这显示了正确的结果。
查询#3:汇总最近1分钟的日志并为其生成统计信息。汇总的文档数与1小时聚合中的文档数相同,理想情况下,它应仅汇总 12-13 log 。
查询#4:汇总最近15秒的日志并为其生成统计信息。聚合的文档数与1小时聚合中的文档数相同,理想情况下,它应仅聚合 3-4个日志。
请帮忙!
查询#1:全部获取
GET /_search
输出:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 314,
"max_score": 1,
"hits": [
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "19",
"_score": 1,
"_source": {
"counter": 19,
"value": 62,
"@timestamp": "2016-11-03T07:40:35.981395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "22",
"_score": 1,
"_source": {
"counter": 22,
"value": 95,
"@timestamp": "2016-11-03T07:40:51.066395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "25",
"_score": 1,
"_source": {
"counter": 25,
"value": 18,
"@timestamp": "2016-11-03T07:41:06.140395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "26",
"_score": 1,
"_source": {
"counter": 26,
"value": 58,
"@timestamp": "2016-11-03T07:41:11.164395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "29",
"_score": 1,
"_source": {
"counter": 29,
"value": 73,
"@timestamp": "2016-11-03T07:41:26.214395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "41",
"_score": 1,
"_source": {
"counter": 41,
"value": 59,
"@timestamp": "2016-11-03T07:42:26.517395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "14",
"_score": 1,
"_source": {
"counter": 14,
"value": 9,
"@timestamp": "2016-11-03T07:40:10.857395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "40",
"_score": 1,
"_source": {
"counter": 40,
"value": 9,
"@timestamp": "2016-11-03T07:42:21.498395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "24",
"_score": 1,
"_source": {
"counter": 24,
"value": 41,
"@timestamp": "2016-11-03T07:41:01.115395"
}
},
{
"_index": "logs_2016-11-03",
"_type": "logs",
"_id": "0",
"_score": 1,
"_source": {
"counter": 0,
"value": 79,
"@timestamp": "2016-11-03T07:39:00.302395"
}
}
]
}
}
查询#2:获取最近1小时的统计信息。
GET /logs_2016-11-03/logs/_search?search_type=count
{
"aggs": {
"time_range": {
"filter": {
"range": {
"@timestamp": {
"from": "now-1h"
}
}
},
"aggs": {
"just_stats": {
"stats": {
"field": "value"
}
}
}
}
}
}
输出:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 366,
"max_score": 0,
"hits": []
},
"aggregations": {
"time_range": {
"doc_count": 366,
"just_stats": {
"count": 366,
"min": 0,
"max": 100,
"avg": 53.17213114754098,
"sum": 19461
}
}
}
}
我得到了366个条目,这是正确的。
查询#3:获取最近1分钟的统计信息。
GET /logs_2016-11-03/logs/_search?search_type=count
{
"aggs": {
"time_range": {
"filter": {
"range": {
"@timestamp": {
"from": "now-1m"
}
}
},
"aggs": {
"just_stats": {
"stats": {
"field": "value"
}
}
}
}
}
}
输出:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 407,
"max_score": 0,
"hits": []
},
"aggregations": {
"time_range": {
"doc_count": 407,
"just_stats": {
"count": 407,
"min": 0,
"max": 100,
"avg": 53.152334152334156,
"sum": 21633
}
}
}
}
这是错误的,它不能在最后1分钟内有407个条目,它应该只有12-13个日志。
查询#4:获取最近15秒的统计信息。
GET /logs_2016-11-03/logs/_search?search_type=count
{
"aggs": {
"time_range": {
"filter": {
"range": {
"@timestamp": {
"from": "now-15s"
}
}
},
"aggs": {
"just_stats": {
"stats": {
"field": "value"
}
}
}
}
}
}
输出:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 407,
"max_score": 0,
"hits": []
},
"aggregations": {
"time_range": {
"doc_count": 407,
"just_stats": {
"count": 407,
"min": 0,
"max": 100,
"avg": 53.152334152334156,
"sum": 21633
}
}
}
}
这也是错误的,它在最后15秒内不能成为407个条目。它应该只有3-4个日志。
答案 0 :(得分:2)
您的查询是正确的,但ES以UTC格式存储日期,因此您将获得所有内容。来自documentation
在JSON文档中,日期表示为字符串。 Elasticsearch 使用一组预配置格式来识别和解析这些格式 将字符串转换为一个长值,表示自从该纪元开始的毫秒数 的 UTC 强>
您可以使用pytz
模块并在ES中以UTC格式存储日期。请参阅this SO问题。
你也可以在范围查询中使用time_zone
param,也最好聚合过滤结果,而不是获取所有结果,然后对所有结果进行过滤。
GET /logs_2016-11-03/logs/_search
{
"query": {
"bool": {
"filter": {
"range": {
"@timestamp": {
"gte": "2016-11-03T07:15:35", <----- You would need absolute value
"time_zone": "-01:00" <---- timezone setting
}
}
}
}
},
"aggs": {
"just_stats": {
"stats": {
"field": "value"
}
}
},
"size": 0
}
您必须将所需的时间( now-1m,now-15s )转换为格式yyyy-MM-dd'T'HH:mm:ss
for time_zone param,以便 now
不受影响按time_zone
,所以最好的选择是将日期转换为UTC并存储它。