我通过R访问弹性批量数据。出于分析目的,我需要查询相对较长时间(例如一个月)的数据。一个月的数据约为450万行,R内存不足。
样本数据低于(1天):
dt <- as.Date("2015-09-01", "%Y-%m-%d")
frmdt <- strftime(dt,"%Y-%m-%d")
todt <- as.Date(dt+1)
todt <- strftime(todt,"%Y-%m-%d")
connect(es_base="http://xx.yy.zzz.kk")
start_date <- as.integer(as.POSIXct(frmdt))*1000
end_date <- as.integer(as.POSIXct(todt))*1000
query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
s_list <- elastic::Search(index = "organised_2015_09",type = "PROPERTY_SEARCH", body=query ,
fields = c("trackId", "time"), size=1000000)$hits$hits
length(s_list)
[1] 144612
1天的结果有144k记录,为222 MB。下面的示例列表项目:
> s_list[[1]]
$`_index`
[1] "organised_2015_09"
$`_type`
[1] "PROPERTY_SEARCH"
$`_id`
[1] "1441122918941"
$`_version`
[1] 1
$`_score`
[1] 1
$fields
$fields$time
$fields$time[[1]]
[1] 1441122918941
$fields$trackId
$fields$trackId[[1]]
[1] "fd4b4ce88101e58623ba9e6e31971d1f"
实际上&#34; trackId&#34;的项目数量的摘要计数和&#34;时间&#34; (每天总结)足以满足分析目的。因此,我尝试将其转换为具有聚合的计数查询。所以我构建了以下查询:
query < -'{"size" : 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"time": {
"gte": 1441045800000,
"lte": 1443551400000
}
}
}
}
},
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "time",
"interval": "day",
"time_zone": "+05:30"
},
"aggs": {
"group_by_state": {
"terms": {
"field": "trackId",
"size": 0
}
}
}
}
}
}'
response <- elastic::Search(index="organised_recent",type="PROPERTY_SEARCH",body=query, search_type="count")
但是我没有获得速度或文档大小。我想我错过了什么,但不确定是什么。