Question

美好的一天，

elasticsearch批量插入出现问题。在我的程序中，文本文件每15秒生成一次，然后下面的脚本（以os.popen('python my_script my_text_file')运行）尝试将数据插入elasticsearch，并在成功后重命名该文件。

每个文本文件的大小为1-9 KB，并采用以下格式：

{'_type': 'a', '_index': 'b', '_source': {'k0': 'v0'}, '_id': 'c0'}
{'_type': 'a', '_index': 'b', '_source': {'k1': 'v1'}, '_id': 'c1'}
{'_type': 'a', '_index': 'b', '_source': {'k2': 'v2'}, '_id': 'c2'}
...
{'_type': 'a', '_index': 'b', '_source': {'kN': 'vN'}, '_id': 'cN'}

下面的我的脚本代码：

import sys
import elasticsearch.helpers
import json
import os

def RetryElastic(data, maxCount =5):
    counter = 0
    while counter<maxCount:
        try:
            res = elasticsearch.helpers.bulk(es,data, max_retries = 10,stats_only = False)
            assert len(res[1])==0
        except Exception as e:
            print(e.__class__,'found') #raise e
            counter+=1
            if counter>=maxCount:
                print(res,'\n', file = open('file.txt','a+'))
        else:
            os.rename(sys.argv[1],sys.argv[1]+"_sent")
            break

es = elasticsearch.Elasticsearch(['host'])
fileInfo= open(sys.argv[1]).read().splitlines()
data = (json.loads(i.replace("'",'"')) for i in fileInfo)
RetryElastic(data)

elasticsearch.helpers“ init .py”（定义批量）中的“执行”代码的一部分：

for ok, item in streaming_bulk(client, actions, **kwargs):
        # go through request-reponse pairs and detect failures
        if not ok:
            if not stats_only:
                errors.append(item)
            failed += 1
        else:
            success += 1

    return success, failed if stats_only else errors

我的res [1]是错误。我没有错误文件（=没有错误），我所有的文件都重命名为file +“ _ sent”。但是，当我检查来自Elasticsearch的数据时，我发现未插入所有数据（已插入大多数数据，但未插入某些数据（某些文件中的数据在Elasticsearch中完全丢失））。这是万一我没有错误的情况。通常，我必须插入100多个文件。

我的错在哪里？

Answer 1

您的ID可能会发生冲突，导致Elasticsearch无法保存所有数据。确保您所有的ID都是唯一的。

Elasticsearch批量插入无法完全正常工作

1 个答案: