Question

我想索引10亿条记录。每条记录都有2个属性（attribute1和attribute2）。必须合并在attribute1中具有相同值的每个记录。例如，我有两个记录

attribute1  attribute2
1   4
1   6

我的弹性文件必须是

{
   "attribute1": "1"
   "attribute2": "4,6"
}

由于数据量巨大，我必须读取一个批量（大约1000条记录）并根据上述规则（在内存中）合并它们，然后在ElasticSearch中搜索它们并将它们与搜索结果合并，然后索引/重新索引他们。总之，我必须分别搜索和索引每个批量。我实现了这个规则，但在某些情况下，Elastic不会返回所有结果，并且某些文档已被重复索引。在每个索引之后我刷新ElasticSearch以便为下一次搜索做好准备。但在某些情况下它不起作用。我的索引设置如下：

{
"test_index": {
    "settings": {
        "index": {
            "refresh_interval": "-1",
            "translog": {
                "flush_threshold_size": "1g"
            },
            "max_result_window": "1000000",
            "creation_date": "1464577964635",
            "store": {
                "throttle": {
                    "type": "merge"
                }
            }
        },
        "number_of_replicas": "0",
        "uuid": "TZOse2tLRqGk-vHRMGc2GQ",
        "version": {
            "created": "2030199"
        },
        "warmer": {
            "enabled": "false"
        },
        "indices": {
            "memory": {
                "index_buffer_size": "40%"
            }
        },
        "number_of_shards": "5",
        "merge": {
           "policy": {
                "max_merge_size": "2g"
            }
        }
    }
}

我该如何解决这个问题？

还有其他设置来处理这种情况吗？

Answer 1

在批量命令中，您需要对第一次出现使用index操作，然后使用脚本update更新attribute2属性：

{ "index" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "attribute1" : "1", "attribute2": [4] }
{ "update" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "script" : { "inline": "ctx._source.attribute2 += attr2", "params" : {"attr2" : 6}}}

在第一次index操作后，您的文档将显示为

{
   "attribute1": "1"
   "attribute2": [4]
}

第二次update操作后，您的文档将显示为

{
   "attribute1": "1"
   "attribute2": [4, 6]
}

请注意，也可以仅对update和doc_as_upsert使用script次操作。

在ElasticSearch中进行索引后搜索

1 个答案: