AWS Glue 作业因连接超时错误而失败

时间:2021-04-29 15:42:38

标签: amazon-web-services aws-glue

我是 AWS Glue 的新手。我创建了一个作业,它使用两个数据目录表并在它们之上运行简单的 SparkSQL 查询。作业在 Transform 步骤中失败并出现异常

pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'

JDBC 源 (Redshift) VPC 安全组同时配置了入站和出站规则。

我在 SO 上看到了另一篇关于为 Glue 本身配置 VPC 端点的帖子,但我不太明白它应该是什么样子?它应该是并连接到glue.us-east-1.amazonaws.com:443 还是其他东西?我很困惑。

UPD:自动生成的 pyspark 脚本

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()

2 个答案:

答案 0 :(得分:0)

您需要添加 Glue Connection 以便连接到您的 Redshift 集群。您必须确保此 Glue 连接位于私有子网中。

答案 1 :(得分:0)

我能够解决这个问题,确实必须有一个 VPC 端点。 除了该连接之外,还应使用带有 NAT 网关的私有子网。我的初始子网没有 NAT。

Terraform 中的 VPC 端点配置示例:

resource "aws_vpc_endpoint" "glue" {
  vpc_id            = var.vpc_id
  service_name      = var.glue_vpc_service_name
  vpc_endpoint_type = "Interface"

  security_group_ids = var.security_group_ids 
  subnet_ids = var.subnet_ids

  tags = { mytag = "mytag"}
}