Question

我的Elasticsearch集群从2B文档下降到900M条记录，在AWS上显示

重定位碎片：4

显示

活动碎片：35

和

活动主分区：34

（可能不相关，但这是其余统计信息）：

节点数：9

数据节点数：6

未分配碎片：17

运行时

GET /_cluster/allocation/explain

它返回：

{
  "index": "datauwu",
  "shard": 6,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2019-10-31T17:02:11.258Z",
    "details": "node_left[removedforsecuritybecimparanoid1]",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "removedforsecuritybecimparanoid2",
      "node_name": "removedforsecuritybecimparanoid2",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid3",
      "node_name": "removedforsecuritybecimparanoid3",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid4",
      "node_name": "removedforsecuritybecimparanoid4",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid5",
      "node_name": "removedforsecuritybecimparanoid5",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid6",
      "node_name": "removedforsecuritybecimparanoid6",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid7",
      "node_name": "removedforsecuritybecimparanoid7",
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

有点困惑，这到底是什么意思，这是否意味着我的Elasticsearch集群没有丢失数据，而是将其重新定位到了不同的分片中，还是找不到它？

如果找不到碎片，是否表示我的数据丢失了？如果是这样，可能是什么原因，我如何防止这种情况将来发生？

我没有建立副本，因为我正在索引数据，而副本在索引时会减慢速度。

也没有，我的记录数曾一度下降到400m，但随后又随机上升到900m。我不知道这意味着什么，任何见识将不胜感激。

Answer 1

“原因”：“ NODE_LEFT”

并且：

我没有建立副本，因为我正在索引数据，而副本在索引时降低了速度。

如果拥有主碎片的节点已经消失，那么是的，您的数据也消失了。毕竟，如果没有副本，那么如果主（也是唯一的）分片不再是集群的一部分，那么集群将从何处检索数据？您可能需要将保存有这些分片的节点备份并添加到群集中，否则数据将消失。

错误消息是：“您希望我为该索引分配一个主碎片，但我知道该索引已经存在，但是以前找不到该主碎片的另一个版本，我不会分配它再次，以防以前的主要对象再次出现。”

您可以通过使用allocate_stale_primary（doc）进行重新路由来强制Elasticsearch重新分配主分片（并明确接受先前主分片中的数据已消失）：

curl -XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
    "commands" : [ {
        "allocate_stale_primary" :
            {
              "index" : "datauwu", "shard" : 6,
              "node" : "target-data-node-id",
              "accept_data_loss" : true
            }
        }
    ]
}'

除了使用一次性数据进行开发以外，关闭其他任何副本通常都不是一个好主意。

也没有，我的记录数曾一度下降到400m，但随后又随机上升到900m。我不知道这意味着什么，任何见识将不胜感激。

之所以会这样，是因为在群集中看不到分片。如果分片的所有副本都已分配，重定位或恢复，则可能发生这种情况。这与RED群集状态相对应。您可以通过确保至少有1个副本来减轻它（尽管理想情况下，您设置了足够数量的副本以抵抗群集中N个数据节点的丢失）。这样，Elasticsearch可以将一个碎片作为主要碎片，而将其他碎片移动。

如果只有主副本，没有副本，则如果要恢复或重新放置主副本，则该分片中的数据将在群集中不可见。分片再次处于活动状态后，其中的文档将变得可见。

ElasticSearch节点故障

1 个答案: