我有30,000个实例在弹性搜索中。如果我在while循环中使用滚动API进行以下设置: 1.“搜索上下文处于活动状态”:50m,“大小”:2 2.“搜索上下文有效”:5m,“大小”:100 两种方式都无法访问实例上的多个大小的窗口。使用此单滚动条,将所有文档取出最多需要20天,而且还有很多其他实例需要解析。
过去在mongoDB中处理数据集时,我遇到了类似的问题。但是,打开不同的流程对我来说很容易。例如。
n_cores = 30
collection_size = collectionTolookfor.count()
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
# hitting service on this http://130.20.47.179:8012 server
processes = [ multiprocessing.Process(target=FTExtraction.entry, args=(full_Text_articles_English , skip_n, batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
类似的方法,我尝试使用“ FROM + SIZE” 子句,而不在Elastic搜索中使用“ SCROLL” 。例如。
fromPtrPool = range(0, 30000+1,6000)
processes = [ Process(target=create_es_pickle, args=(datalake_url, elastic_search_url, ack_year_month , fromPtr, 6000)) for fromPtr in fromPtrPool]
for process in processes:
process.start()
for process in processes:
process.join()
上述多处理代码使用以下代码进行请求:
def create_es_pickle(--with all arguments--):
resp = requests.post(elastic_search_url + \
'/data/_search?from={}'.format(fromPtr), json=query).json()
这会在响应对象中产生以下错误:
error': {'root_cause': [{'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'data', 'node': 'jt018XEgT6aIjIbXCZfZdg', 'reason': {'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}}], 'caused_by': {'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}}, 'status': 500}
所有6个过程都立即终止,但是有1个过程继续执行以减少记录。
我是使用ES的新手,并且需要有关如何打开多个连接/游标/滚动条的建议,这些连接/游标/滚动条根据上述方案将我的大数据集分了块。 请提出建议。