Question

我正在使用Terraform创建一个EMR集群（emr-5.24.0），并将其部署到一个私有子网中，该私有子网包括Spark，Hive和JupyterHub。

我在部署中添加了额外的配置JSON，这应该将Jupiter笔记本的持久性添加到S3中（而不是在磁盘上本地存储）。

整个体系结构包括一个S3的VPC端点，我能够访问要向其写入笔记本的存储桶。

配置群集后，JupyterHub服务器无法启动。

登录到主节点并尝试启动/重新启动jupyterhub的Docker容器无济于事。

此持久性的配置如下：

[
    {
        "Classification": "jupyter-s3-conf",
        "Properties": {
            "s3.persistence.enabled": "true",
            "s3.persistence.bucket": "${project}-${suffix}"
        }
    },
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

然后在terraform EMR资源定义中引用它：

configurations         = "${data.template_file.configuration.rendered}"

这是从：

data "template_file" "configuration" {
  template = "${file("${path.module}/templates/cluster_configuration.json.tpl")}"

  vars = {
    project  = "${var.project_name}"
    suffix   = "bucket"
  }
}

当我不在笔记本上使用持久性时，一切正常，我能够登录JupyterHub。

我非常确定这不是IAM策略问题，因为EMR群集角色策略的“允许”操作定义为“ s3：*”。

要使其正常运行，是否还需要采取其他步骤？

/ K

Answer 1

似乎EMR上的jupyter使用S3ContentsManager与S3连接。

https://github.com/danielfrg/s3contents

我挖了一点S3ContentsManager并发现了S3端点，这是公共端点（如预期的那样）。由于S3的端点是公共端点，因此jupyter需要访问Internet，但是您正在私有子网中运行EMR，我无法连接端点。

您可能需要在公共子网中使用NAT网关或为VPC创建s3端点。

Answer 2

是的。我们也遇到了这个问题。添加一个S3 VPC终端节点，然后从AWS支持-

添加JupyterHub笔记本配置：

{
"Classification": "jupyter-notebook-conf",
"Properties": {
"config.S3ContentsManager.endpoint_url": "\"https://s3.${aws_region}.amazonaws.com\"",
"config.S3ContentsManager.region_name": "\"${aws_region}\""
}
},

hth

JupyterHub服务器无法在专用子网中运行的Terraformed EMR群集中启动

2 个答案: