弹性:进行轻量计数查询(vs搜索查询)

时间:2015-10-27 15:34:24

标签: r elasticsearch

我通过R访问弹性批量数据。出于分析目的,我需要查询相对较长时间(例如一个月)的数据。一个月的数据约为450万行,R内存不足。

样本数据低于(1天):

dt <- as.Date("2015-09-01", "%Y-%m-%d")
frmdt <- strftime(dt,"%Y-%m-%d")
  todt <- as.Date(dt+1)
  todt <- strftime(todt,"%Y-%m-%d")

  connect(es_base="http://xx.yy.zzz.kk")
  start_date <- as.integer(as.POSIXct(frmdt))*1000
  end_date <- as.integer(as.POSIXct(todt))*1000


  query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
  s_list <- elastic::Search(index = "organised_2015_09",type = "PROPERTY_SEARCH", body=query ,
                     fields = c("trackId", "time"), size=1000000)$hits$hits
  length(s_list)
[1] 144612

1天的结果有144k记录,为222 MB。下面的示例列表项目:

> s_list[[1]]
$`_index`
[1] "organised_2015_09"

$`_type`
[1] "PROPERTY_SEARCH"

$`_id`
[1] "1441122918941"

$`_version`
[1] 1

$`_score`
[1] 1

$fields
$fields$time
$fields$time[[1]]
[1] 1441122918941


$fields$trackId
$fields$trackId[[1]]
[1] "fd4b4ce88101e58623ba9e6e31971d1f"

实际上&#34; trackId&#34;的项目数量的摘要计数和&#34;时间&#34; (每天总结)足以满足分析目的。因此,我尝试将其转换为具有聚合的计数查询。所以我构建了以下查询:

query < -'{"size" : 0,
"query": {
    "filtered": {
        "query": {
            "match_all": {}
        },
        "filter": {
            "range": {
                "time": {
                    "gte": 1441045800000,
                    "lte": 1443551400000
                }
            }
        }
    }
},
"aggs": {
    "articles_over_time": {
        "date_histogram": {
            "field": "time",
            "interval": "day",
            "time_zone": "+05:30"
        },
        "aggs": {
            "group_by_state": {
                "terms": {
                    "field": "trackId",
                    "size": 0
                }
            }
        }
    }
}
}'

response <- elastic::Search(index="organised_recent",type="PROPERTY_SEARCH",body=query, search_type="count")

但是我没有获得速度或文档大小。我想我错过了什么,但不确定是什么。

0 个答案:

没有答案