我正在尝试使用Elasticsearch parallel_bulk导入大量数据。 这是我的索引结构:
{
"_index" : "myindex",
"_type" : domain,
"_id" : md5(email),
"_score" : 1.0,
"_source" : {
"purchase_date" : purchase_date,
"amount" : amount,
}
}
这是我的python代码:
def insert(input_file):
paramL = []
with open(input_file) as f:
for line in f:
line = line.rstrip()
fields = line.split(',')
purchase_date = fields[0]
amount = fields[1]
email = fields[2]
id_email = getMD5(email)
doc = {
"email": email,
"purchase_date": purchase_date,
"amount": amount _date
}
ogg = {
'_op_type': 'index',
'_index': index_param,
'_type': doctype_param,
'_id': id_email,
'_source': doc
}
paramL.append(ogg)
if len(paramL) > 500000:
for success, info in helpers.parallel_bulk(client=es, actions=paramL, thread_count=4):
if not success:
print "Insert failed: ", info
# empty paramL if size > 5.000.000
del paramL[:]
该文件包含42.644.394行,我认为每次列表都会插入数据" paramL"约为5000.000个元素。 因此,当我运行脚本时,它会插入大约436.226个值,直到它崩溃并出现以下错误:
回溯(最近一次呼叫最后一次):文件" test-2-0.py",行 133,在 main()文件" test-2-0.py",第131行,在main中 insert(args.file)File" test-2-0.py",第82行,插入 为成功,helpers.parallel_bulk中的信息(client = es,actions = paramL,thread_count = 4):文件 " /usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/的初始化的.py&#34 ;, 第306行,在parallel_bulk中 _chunk_actions(actions,chunk_size,max_chunk_bytes,client.transport.serializer)文件 " /usr/lib/python2.7/multiprocessing/pool.py" ;,第668行,下一步 提升值elasticsearch.exceptions.ConnectionTimeout:由 - 引起的ConnectionTimeout - ReadTimeoutError(HTTPConnectionPool(host = u' 127.0.0.1',port = 9200): 读取超时。 (读取超时= 10))
我还试图增加在Elasticsearch构造函数中传递它的超时
es = Elasticsearch(['127.0.0.1'], request_timeout=30)
但结果是一样的。
答案 0 :(得分:1)
我真的从来没有用如此多的文档批量导入来表示。我不知道为什么会出现这个错误。在你的情况下,我建议不要创建一个列表-paramL - 而是用生成器函数来管理你的数据 - 正如弹性开发人员在弹性论坛中对大量大量摄取的最佳实践:https://discuss.elastic.co/t/helpers-parallel-bulk-in-python-not-working/39498/3。像这样:
def insert(input_file):
with open(input_file) as f:
for line in f:
line = line.rstrip()
fields = line.split(',')
purchase_date = fields[0]
amount = fields[1]
email = fields[2]
id_email = getMD5(email)
doc = {
"email": email,
"purchase_attack": purchase_date,
"amount _relevation": amount _date
}
yield {
'_op_type': 'index',
'_index': index_param,
'_type': doctype_param,
'_id': id_email,
'_source': doc
}
for success, info in helpers.parallel_bulk(client=es, actions=insert(input_file), thread_count=4):
if not success:
print "Insert failed: ", info
您可以在java虚拟机中增加专用于弹性的空间来编辑此文件/etc/elasticsearch/jvm.options
要分配2 GB的RAM,您应该更改 - 如果您的计算机有4 GB,您应该保留近1 GB的系统,因此您可以分配最大3 GB:
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms2g
-Xmx2g
然后你必须重启服务
sudo service elasticsearch restart
再试一次。祝你好运