如何在Elastic搜索中的数据集上创建并行游标?

时间:2018-07-18 19:52:16

标签: python-3.x elasticsearch multiprocessing

我有30,000个实例在弹性搜索中。如果我在while循环中使用滚动API进行以下设置: 1.“搜索上下文处于活动状态”:50m,“大小”:2 2.“搜索上下文有效”:5m,“大小”:100 两种方式都无法访问实例上的多个大小的窗口。使用此单滚动条,将所有文档取出最多需要20天,而且还有很多其他实例需要解析。

过去在mongoDB中处理数据集时,我遇到了类似的问题。但是,打开不同的流程对我来说很容易。例如。

   n_cores = 30        
    collection_size = collectionTolookfor.count()
    batch_size = round(collection_size/n_cores+0.5)
    skips = range(0, n_cores*batch_size, batch_size)
    # hitting service on this http://130.20.47.179:8012 server
    processes = [ multiprocessing.Process(target=FTExtraction.entry, args=(full_Text_articles_English , skip_n, batch_size)) for skip_n in skips]
    for process in processes:
        process.start()
    for process in processes:
        process.join()

类似的方法,我尝试使用“ FROM + SIZE” 子句,而不在Elastic搜索中使用“ SCROLL” 。例如。

fromPtrPool = range(0, 30000+1,6000)
processes = [ Process(target=create_es_pickle, args=(datalake_url, elastic_search_url, ack_year_month , fromPtr, 6000)) for fromPtr in fromPtrPool]
for process in processes:
    process.start()
for process in processes:
    process.join()

上述多处理代码使用以下代码进行请求:

def create_es_pickle(--with all arguments--):
                 resp = requests.post(elastic_search_url + \
                             '/data/_search?from={}'.format(fromPtr), json=query).json()

这会在响应对象中产生以下错误:

 error': {'root_cause': [{'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'data', 'node': 'jt018XEgT6aIjIbXCZfZdg', 'reason': {'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}}], 'caused_by': {'type': 'query_phase_execution_exception', 'reason': 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}}, 'status': 500}

所有6个过程都立即终止,但是有1个过程继续执行以减少记录。

我是使用ES的新手,并且需要有关如何打开多个连接/游标/滚动条的建议,这些连接/游标/滚动条根据上述方案将我的大数据集分了块。 请提出建议。

0 个答案:

没有答案