我是 AWS Glue 的新手。我创建了一个作业,它使用两个数据目录表并在它们之上运行简单的 SparkSQL 查询。作业在 Transform 步骤中失败并出现异常
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'
JDBC 源 (Redshift) VPC 安全组同时配置了入站和出站规则。
我在 SO 上看到了另一篇关于为 Glue 本身配置 VPC 端点的帖子,但我不太明白它应该是什么样子?它应该是并连接到glue.us-east-1.amazonaws.com:443 还是其他东西?我很困惑。
UPD:自动生成的 pyspark 脚本
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()
答案 0 :(得分:0)
您需要添加 Glue Connection 以便连接到您的 Redshift 集群。您必须确保此 Glue 连接位于私有子网中。
答案 1 :(得分:0)
我能够解决这个问题,确实必须有一个 VPC 端点。 除了该连接之外,还应使用带有 NAT 网关的私有子网。我的初始子网没有 NAT。
Terraform 中的 VPC 端点配置示例:
resource "aws_vpc_endpoint" "glue" {
vpc_id = var.vpc_id
service_name = var.glue_vpc_service_name
vpc_endpoint_type = "Interface"
security_group_ids = var.security_group_ids
subnet_ids = var.subnet_ids
tags = { mytag = "mytag"}
}