Question

我目前正遇到一个我真的陷入困境的问题。我正在尝试解决一个问题，我必须输出Elasticsearch文档并将它们写入csv。文档范围从50,000到500万。我遇到了严重的性能问题，我觉得我在这里遗漏了一些东西。

现在我有一个400,000个文档的数据集，我正在尝试扫描和滚动这些文档，最终将格式化并写入csv。但是输出的时间是20分钟!!那太疯狂了。

这是我的剧本：

import elasticsearch
import elasticsearch.exceptions 
import elasticsearch.helpers as helpers
import time

es =  elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)

scanResp = helpers.scan(client=es,scroll="5m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)

resp={}
start_time = time.time()
for resp in scanResp:
    data = resp
    print data.values()[3]

print("--- %s seconds ---" % (time.time() - start_time))

我正在为Elasticsearch使用托管的AWS m3.medium服务器。

有谁能告诉我这里可能做错了什么？

Answer 1

将ES数据输出到CSV的简单解决方案是使用带有elasticsearch input和csv output的Logstash以及以下es2csv.conf配置：

input {
  elasticsearch {
   host => "localhost"
   port => 9200
   index => "MyDoc"
  }
}
filter {
 mutate {
  remove_field => [ "@version", "@timestamp" ]
 }
}
output {
 csv {
   fields => ["field1", "field2", "field3"]  <--- specify the field names you want 
   path => "/path/to/your/file.csv"
 }
}

然后，您可以使用bin/logstash -f es2csv.conf

轻松导出数据

使用“扫描和滚动”，Elasticsearch批量写入速度很慢

1 个答案: