Question

我有一个应用程序，以前文档字段不需要是一个数组，如

"tags" : "tag1"

现在应用程序要求该字段是类似

的数组

"tags" : ["tag1","tag2"]

目前在ElasticSearch中有4.5M文档

所以我编写了一个bash脚本来更新1000个文档，但这需要2分钟以上，这意味着需要8天才能运行4.5M文档。这似乎是我做错了什么。有弹性的最佳方法是什么？这是bash脚本

#!/bin/bash 
echo "Starting"
IDS=$(curl -XGET 'http://elastichost/index/_search?size=1000' -d '{ "query" : {"match_all" : {}},"fields":"[_id]"}' | grep -Po '"_id":.*?[^\\]",'| awk -F':' '{print $2}'| sed -e 's/^"//' -e 's/",$//')
#Create an array out of the IDS
array=($IDS)
#Loop through the IDS and update them
for i in "${!array[@]}"
    do
        echo "$i=>|${array[i]}|"
            curl -XPOST "http://elastichost/index/type/${array[i]}/_update" -d '
              {
                "script" : "ctx._source.tags = [ctx._source.tags]"
              }'
    done
echo "\nFinished"

Answer 1

添加＆＃34;＆gt; / dev / null 2＆gt;＆amp; 1＆amp;＆＃34;命令，以确保进程正确分叉，不记录任何地方。

等效的shell命令如下所示：

    curl -XPOST "http://elastichost/index/type/${array[i]}/_update" -d '
      {
        "script" : "ctx._source.tags = [ctx._source.tags]"
      }' > /dev/null 2>&1 &

分叉进程需要1ms多一点，然后使用大约4k的驻留内存。虽然curl进程采用标准SSL 300ms来发出请求

在我的中等大小的机器上，我每秒可以分叉100个HTTPS卷曲请求，而不会将它们堆叠在内存中。没有SSL，它可以做得更多：

在不等待输出的情况下分叉过程很快。
curl需要同时将请求作为套接字，但确实如此带外处理。
分叉卷曲只需要普通的unix原语。
分叉只将一个请求设置回几毫秒，但很多并发分叉将开始减慢您的服务器速度。

请勿在终端中回显任何内容。

参考：link

ElasticSearch使用批量API和脚本更新所有文档

1 个答案: