Question

我有一个非常简单的索引任务，该任务从文件中读取行分隔的json，对传入数据进行简单的操作，然后使用elasticsearch parallel_bulk进行索引。我在服务器上使用python3，ES 6.7（2个节点，每个16G）。

当我运行下面的代码时，我看到内存使用量稳定增加，直到机器开始交换为止。数据不是很大（每行少于500个字符），所以我对内存使用感到惊讶。

我已经对队列大小，线程数和块大小进行了一些微调，并且有相当多的敏感性。我想优化客户端的索引速度（我已经在服务器端完成工作以加快速度），同时将内存保持在合理的数量。

代码：

def yield_records(idx='sra_bulk_test_biosample'):
    for fname in list(filter(lambda f: f.endswith('.json.gz'),os.listdir())):
        print(fname)
        with gzip.open(fname, 'r') as f:
            n = 0
            for line in f:
                n+=1
                if(n % 10000 == 0):
                    print(n)
                res = json.loads(line)
                yield {
                    "_op_type": "index",                   # this is the default
                    "_index": idx,
                    "_type": 'doc',                        # no more doc types after es_6
                    "_id": res['accession'], # extract _id from record
                    "_source": res                         # use entire record as "source"
                }

def main():
    for success, info in parallel_bulk(es, actions = yield_records(), chunk_size=1000, queue_size=4, thread_count=4, yield_ok = False):
        if(not success):
            print(info)

if __name__ == '__main__':
    main()

为什么Elasticsearch python parallel_bulk使用了这么多内存？

0 个答案: