在私有子网内的群集中运行ECS任务仍处于配置状态

时间:2020-08-27 18:14:31

标签: amazon-web-services amazon-s3 terraform amazon-ecs

我们要构建具有以下特征的ECS集群:

  1. 它必须在VPC内运行,然后,我们需要awsvpc模式
  2. 它必须使用GPU实例,所以我们不能使用Fargate
  3. 它必须动态地配置实例,因此,我们需要一个容量提供者
  4. 它将运行将直接通过AWS ECS API触发的任务(批处理作业)。因此,我们不需要服务,只需任务定义。
  5. 这些任务必须有权访问S3(互联网),因此根据AWS文档,必须将实例放置在专用子网(a reference to docs)内。

我们已经在stackoverflow中读过this post,其中说我们需要建立一个私有子网,该私有子网的路由表指向在公共子网中配置的NAT网关,并且该公共子网应指向互联网网关。我们已经有此配置。我们还在路由表中配置了一个S3 vpc端点。

以下,您可以在terraform中看到群集的一些相关配置(为简单起见,我只放置了相关部分):


# Launch template
resource "aws_launch_template" "train-launch-template" {
  name_prefix   = "{var.project_name}-launch-template-${var.env}"
  image_id      = "ami-01f62a207c1d180d2"
  instance_type = "m5.large"
  key_name="XXXXXX"
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs-instance-profile.name
  }
  user_data = base64encode(data.template_file.user_data.rendered)

  network_interfaces {
    associate_public_ip_address = false
    security_groups = [aws_security_group.ecs_service.id]
  }
}


# Task definition
resource "aws_ecs_task_definition" "task" {
  family                   = "${var.project_name}-${var.env}-train-task"
  execution_role_arn       = data.aws_iam_role.ecs_task_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_train_task_role.arn
  requires_compatibilities = ["EC2"]
  cpu                      = var.ecs_cpu
  network_mode             = "awsvpc"
  memory                   = var.ecs_memory
  container_definitions    = data.template_file.app_definition.rendered

  tags = {
    Stage   = var.env_tag
    Project = var.project_name_tag
  }
}


# Cluster
resource "aws_ecs_cluster" "cluster" {
  name = "${var.project_name}-${var.env}-train-ecs-cluster"
  capacity_providers = [aws_ecs_capacity_provider.train-capacity-provider.name]
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.train-capacity-provider.name
  }
  tags = {
    Project = var.project_name_tag
    Stage   = var.env_tag
  }
}

我们还配置了实例所需的所有角色以及访问所需资源(S3,ECR,ECS)的任务。

AMI与ECS优化实例相对应(这是eu-west-1中当前发布的最新版本)。

由于this link

中的解释,在启动模板中,我们已将公共IP删除到实例中

我们已经演化为尝试使之工作的配置,但一次又一次遇到相同的问题:触发任务时,容量提供者启动一个实例,但任务从未放置在容器中实例并无限期地处于PROVISIONING状态。

使用相同的配置,但是将实例放置在公共子网中,任务被放置在容器实例中,但是,正如the first link所警告的那样,任务无法访问Internet。

我们需要一些启示或跟踪。预先谢谢你。

更新:根据要求,我添加了与自动缩放有关的其余部分

resource "aws_autoscaling_group" "train-autoscaling" {
  availability_zones = ["eu-west-1b"]
  desired_capacity   = 0
  max_size           = 10
  min_size           = 0
  protect_from_scale_in = true
  

  launch_template {
    id      = aws_launch_template.train-launch-template.id
    version = "$Latest"
  }

  tags = [
    {
      key = "Project",
      value = var.project_name_tag
      propagate_at_launch = true
    },
    {
      key = "Stage",
      value = var.env_tag
      propagate_at_launch = true
    }
  ]
}

resource "aws_ecs_capacity_provider" "train-capacity-provider" {
  name = "${var.project_name}-${var.env}-train-capacity-provider"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.train-autoscaling.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      maximum_scaling_step_size = 1
      minimum_scaling_step_size = 1
    }
  }
}

data "template_file" "user_data" {
  template = "${file("${path.module}/user_data.sh")}"

  vars = {
    cluster_name = "${var.project_name}-${var.env}-train-ecs-cluster"
  }
}

更新2(AWS控制台信息):

容器实例正在运行 Container instances running

详细信息容器实例: enter image description here

待处理任务: pending task

待处理任务的详细信息: pending task details

更新3:

30分钟后,任务停止,并且显示以下消息(任务无法启动): enter image description here

更新4:

来自容器实例的日志。 ecs-agent.log

level=info time=2020-08-28T11:09:21Z msg="Loading configuration" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Amazon ECS agent Version: 1.44.1, Commit: 1f05fbf0" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2020-08-28T11:09:21Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2020-08-28T11:09:21Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:21Z msg="Loading state!" module=state_manager.go
level=info time=2020-08-28T11:09:23Z msg="Registering Instance with ECS" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Remaining mem: 7680" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registered container instance with cluster!" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registration completed successfully. I am running as 'arn:aws:ecs:eu-west-1:XXXXXXXXXXXXXXXX:container-instance/foqum-read-dev-train-ecs-cluster/95559f936f8d44de9373595009fcd588' in cluster 'foqum-read-dev-train-ecs-cluster'" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Beginning Polling for updates" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Initializing stats engine" module=engine.go
level=info time=2020-08-28T11:09:23Z msg="Event stream DeregisterContainerInstance start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXXX-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXXXX%3Acontainer-instance%2FXXXXXXXX-cluster%2F95559fXXXXXXde9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:09:23Z msg="NO_PROXY set:XXX.254.169.XXXX,XXXX.254.XXX.2,/var/run/docker.sock" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-a-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&clusterArn=XXXXX-ecs-cluster&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%XXXXXX%3Acontainer-instance%2FXXXXX-ecs-cluster%2F9XXXXX6f8d44de9373595009fcd588&dockerVersion=DockerVersion%3A+19.03.6-ce&sendCredentials=true&seqNum=1" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Connected to TCS endpoint" module=handler.go
level=info time=2020-08-28T11:09:23Z msg="Connected to ACS endpoint" module=acs_handler.go
level=info time=2020-08-28T11:20:04Z msg="TCS Websocket connection closed for a valid reason" module=handler.go
level=info time=2020-08-28T11:20:04Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXecs-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXX3Acontainer-instance%2FZZZXXXXX-ecs-cluster%2F95XXX936f8d44de9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:20:04Z msg="Connected to TCS endpoint" module=handler.go

ecs-init.log

2020-08-28T11:09:19Z [INFO] pre-start
2020-08-28T11:09:20Z [INFO] start
2020-08-28T11:09:20Z [INFO] No existing agent container to remove.
2020-08-28T11:09:20Z [INFO] Starting Amazon Elastic Container Service Agent

1 个答案:

答案 0 :(得分:2)

最后!解决了这个谜!

问题不在群集配置中。通过ECS API调用run_task时,您需要指定任务应运行的子网。

我们的代码在此字段中设置了公共子网之一的值。因此,当我们将容器实例更改为与该公共子网相对应的可用区域时,任务就被放置了。

从代码中更改此调用即可正确放置任务,并且可以访问Internet。