Question

我一直在通过AWS胶水教程（https://docs.aws.amazon.com/glue/latest/dg/getting-started.html）工作，现在我正在尝试配置我的第一份工作，旨在将RDS表中的所有数据复制到S3上的镶木地板文件中。

我成功了：

创建了S3 VPC端点
创建了胶水RDS连接和抓取工具
成功将RDS表元数据添加到目录。

创建我的工作：

从胶水仪表板中选择“添加作业”
在给作业命名的情况下，为上面的RDS连接分配了相同的ROLE（因为它被分配了AWSGlueServiceRole策略），选择“由AWS Glue生成的建议脚本”并将其他字段保留为默认值。
从目录中选择所需的RDS表作为源输出选择'在数据目标中创建表'，使用s3作为数据源，镶木地板作为格式，与目标选择新创建的输出s3文件夹 - 'aws-glue-test-etl / data'
单击“下一步”后，我将所有字段映射都保留为默认值。
保存作业并编辑脚本

当我使用默认值运行作业时，我得到以下日志输出：

--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 --conf spark.hadoop.fs.defaultFS=hdfs://ip-10-0-1-88.eu-west-1.compute.internal:8020 --conf spark.hadoop.yarn.resourcemanager.address=ip-10-0-1-88.eu-west-1.compute.internal:8032 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=18 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --JOB_ID j_20380e2f5d565a53d8bd397904dd210cbca826f3825ae8ff6b5a23e8f7bca45d --JOB_RUN_ID jr_6d60e2930a43a06edf6b6e8307171e88bd754ac5f9e66f2eaf5373e570b61280 --scriptLocation s3://aws-glue-scripts-558091818291-eu-west-1/MarcFletcher/UpdateAccountsExport-py --job-bookmark-option job-bookmark-disable --job-language python --TempDir s3://aws-glue-temporary-558091818291-eu-west-1/MarcFletcher --JOB_NAME UpdateAccountsExport-py

YARN_RM_DNS=ip-10-0-1-88.eu-west-1.compute.internal

Detected region eu-west-1

JOB_NAME = UpdateAccountsExport-py

Specifying eu-west-1 while copying script.

S3 copy with region specified failed. Falling back to not specifying region.

以下错误输出：

fatal error: HTTPSConnectionPool(host='aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com', port=443): Max retries exceeded with url: /MarcFletcher/UpdateAccountsExport-py (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection object at 0x7f9b11afbf10>, 'Connection to aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com timed out. (connect timeout=60)'))

Error downloading script: fatal error: HTTPSConnectionPool(host='aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com', port=443): Max retries exceeded with url: /MarcFletcher/UpdateAccountsExport-py (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection object at 0x7fe752548f10>, 'Connection to aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com timed out. (connect timeout=60)'))

我已查看了问题排查指南（https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html），但未找到任何可能的解决方案。自动选择eu-west-1的区域是正确的。

如果有人能够指出工作出错的地方，那将非常感激。

Answer 1

最可能的安全组端口阻止问题。

检查您附加到粘合连接的AWS安全组出口规则，允许443端口上的TCP到所有

Answer 2

在子网路由表中有一个S3端点很重要。

https://docs.aws.amazon.com/glue/latest/dg/start-development-endpoint.html https://github.com/awsdocs/aws-glue-developer-guide/blob/master/doc_source/vpc-endpoints-s3.md

尽管如此，我还发现在设置boto3资源时必须指定区域。

我找不到此文件，或相关的boto.config文件已记录。

s3 = boto3.resource('s3', 'ap-southeast-2')
file = s3.Object('bucket_name', 'file_key.txt')
file_contents = file.get()['Body'].read()

Answer 3

一旦设置了VPC端点，请务必牢记can only route traffic within a single AWS region。这意味着您尝试访问的S3存储桶必须与AWS Glue相关资源（尤其是S3 VPC端点）位于同一区域。

Answer 4

我使用了默认的安全组，该组允许TCP都为0.0.0.0/0，HTTPS为443，然后失败了

Answer 5

如果禁用增强型 VPC 路由，请检查 Redshift。

打开 Redshift 集群 -> 属性/网络和安全设置/编辑和禁用增强型 VPC 路由

AWS：“启用此选项会强制您的集群和数据存储库之间的网络流量通过 VPC 而不是互联网”

由于S3超时，胶水作业失败

5 个答案: