Question

作为数据分析的一部分，我收集了需要存储在Elasticsearch中的记录。截至目前，我将记录收集在一个中间列表中，然后通过bulk update写入。

虽然这有效，但当记录数量太大而不适合内存时，它有其局限性。因此，我想知道是否可以使用＆＃34;流媒体＆＃34;机制，这将允许

持续打开与elasticsearch的连接
以类似批量的方式不断更新

据我所知，我可以简单地打开与Elasticsearch的连接，并在数据可用时进行经典更新，但这大约慢了10倍，所以我想保留批量机制：

import elasticsearch
import elasticsearch.helpers
import elasticsearch.client
import random
import string
import time

index = "testindexyop1"
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
if elasticsearch.client.IndicesClient(es).exists(index=index):
    ret = elasticsearch.client.IndicesClient(es).delete(index=index)

data = list()
for i in range(1, 10000):
    data.append({'hello': ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))})

start = time.time()
# this version takes 25 seconds
# for _ in data:
#     res = es.bulk(index=index, doc_type="document", body=_)

# and this one - 2 seconds
elasticsearch.helpers.bulk(client=es, index=index, actions=data, doc_type="document", raise_on_error=True)

print(time.time()-start)

Answer 1

您始终可以简单地将数据拆分为n个大致相同大小的集合，以便每个集合都适合内存，然后进行批量更新。这对我来说似乎是最简单的解决方案。

流媒体和批量更新到elasticsearch

1 个答案: