Question

我知道Terrath谷歌提供商在Github上也存在类似问题，涉及google_container_cluster的幂等性；但是，似乎没有一个与我的简单示例相符。任何应用Terraform计划的尝试都希望销毁并重新创建我的集群，这需要6分钟以上的时间。

集群没有明显的变化，但是地形状态表明集群的ID是集群的名称，但是新的ID是经过计算的；因此，必须重新创建集群。我可以防止这种情况吗？

我遵循建议的设置集群的示例：使用remove_initial_node_pool=true和initial_node_count=1定义集群，然后创建一个显式节点池作为从属资源。我也尝试过使用初始节点池创建默认集群。我没有指定其他与幂等性问题相关联的其他属性（例如master_ipv4_cidr_block）。

这是基本的Terraform设置。我正在使用Terraform v0.11.13和provider.google v2.6.0。

provider "google" {
  project     = "${var.google_project}"
  region      = "${var.google_region}"
  zone        = "${var.google_zone}"
}

resource "google_container_cluster" "cluster" {
  project                  = "${var.google_project}"
  name                     = "${var.cluster_name}"
  location                 = "${var.google_region}"

  remove_default_node_pool = true
  initial_node_count       = 1

  master_auth {
    username = ""
    password = ""
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }

}

resource "google_container_node_pool" "cluster_nodes" {
  name       = "${var.cluster_name}-node-pool"
  cluster    = "${google_container_cluster.cluster.name}"
  node_count = "${var.cluster_node_count}"

  node_config {
    preemptible  = "${var.preemptible}"
    disk_size_gb = "${var.disk_size_gb}"
    disk_type    = "${var.disk_type}"
    machine_type = "${var.machine_type}"
    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }
}

output "cluster_ca_certificate" {
  value = "${google_container_cluster.cluster.master_auth.0.cluster_ca_certificate}"
}

output "host" {
  value = "${google_container_cluster.cluster.endpoint}"
}

provider "kubernetes" {
  host                   = "${google_container_cluster.cluster.endpoint}"
  client_certificate     = "${base64decode(google_container_cluster.cluster.master_auth.0.client_certificate)}"
  client_key             = "${base64decode(google_container_cluster.cluster.master_auth.0.client_key)}"
  cluster_ca_certificate = "${base64decode(google_container_cluster.cluster.master_auth.0.cluster_ca_certificate)}"
}

以此类推。未显示用于启用Helm服务帐户的服务帐户和群集角色绑定以及Helm版本。我认为这里无关紧要。

如果我两次执行terraform apply，则第二次调用要销毁并创建一个新集群。什么都没有改变，所以不应该发生。
正常情况下，这是可以的，除了我倾向于从terraform提供商那里看到很多超时，并且不得不重新应用之外，这无济于事，因为重新应用会导致群集被破坏并重新创建。

terraform apply的输出如下：

terraform-gke$ terraform apply
data.template_file.gke_values: Refreshing state...
google_container_cluster.cluster: Refreshing state... (ID: test-eric)

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create
-/+ destroy and then create replacement

Terraform will perform the following actions:

-/+ google_container_cluster.cluster (new resource required)
      id:                                              "test-eric" => <computed> (forces new resource)
      additional_zones.#:                              "3" => <computed>
      addons_config.#:                                 "1" => <computed>
      cluster_autoscaling.#:                           "0" => <computed>
      cluster_ipv4_cidr:                               "10.20.0.0/14" => <computed>
      enable_binary_authorization:                     "" => <computed>
      enable_kubernetes_alpha:                         "false" => "false"
      enable_legacy_abac:                              "false" => "false"
      enable_tpu:                                      "" => <computed>
      endpoint:                                        "34.66.113.0" => <computed>
      initial_node_count:                              "1" => "1"
      instance_group_urls.#:                           "0" => <computed>
      ip_allocation_policy.#:                          "0" => <computed>
      location:                                        "us-central1" => "us-central1"
      logging_service:                                 "logging.googleapis.com" => <computed>
      master_auth.#:                                   "1" => "1"
      master_auth.0.client_certificate:                "" => <computed>
      master_auth.0.client_certificate_config.#:       "1" => "0" (forces new resource)
      master_auth.0.client_key:                        <sensitive> => <computed> (attribute changed)

Answer 1

似乎您已从基本（用户名/密码）切换为TLS授权，因为根据您的日志，您将生成新证书并强制使用新集群。

Answer 2

因此，最终，这是一个提供程序错误，但这是由Kubernetes主服务的行为所引起的，该行为在1.11.x和1.12.x版本之间发生了变化，Google最近才推出了该行为，将其作为GKE节点的默认设置。它已在Terraform Google提供程序的Github问题中以#3369的形式捕获。

解决方法是告诉Terraform忽略master_auth和network中的更改：

resource google_container_cluster cluster {
  master_auth {
    username = ""
    password = ""
  }
  # Workaround for issue 3369 (until provider version 3.0.0?)
  # This is necessary when using GKE node version 1.12.x or later.
  # It is possible to make GKE use node version 1.11.x as an
  # alternative workaround.
  lifecycle {
    ignore_changes = [ "master_auth", "network" ]
  }
}

nb 也许是为了帮助其他遇到相同问题的人...很难搜索诸如Web和Github之类的地方来找到此类问题的相关答案，因为作者使用许多不同的术语来描述Terraform表现出的行为。有时也将此问题描述为Terraform幂等性和Terraform更改的问题。

有什么方法可以防止Terraform google_container_cluster在没有任何变化的情况下被破坏并重新创建？

2 个答案: