分区大文件时,AWS Glue作业错误

时间:2019-10-17 09:23:51

标签: pyspark aws-glue

我有这个python Glue脚本:

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
from awsglue.dynamicframe import DynamicFrame

# .... default glue stuff

df = dropnullfields3.toDF()

new_df = df.withColumn('datestamp', to_date(from_unixtime(col('timestamp')))) \
    .withColumn('year', year(col('datestamp'))) \
    .withColumn('month', month(col('datestamp'))) \
    .withColumn('day', dayofmonth(col('datestamp'))) \
    .drop(col('datestamp')) \
    .repartition(1)

dynamic_frame = DynamicFrame.fromDF(new_df, glueContext, 'enriched')

## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://bucket/dms/folder"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame=dynamic_frame,
                                                         connection_type="s3",
                                                         connection_options={
                                                             "path": "s3://bucket/dms/folder",
                                                             "partitionKeys": ["year", "month", "day"]
                                                         },
                                                         format="parquet",
                                                         transformation_ctx="datasink4")
job.commit()

我正在尝试将17GB的镶木地板文件分区,分成17个文件,每个文件最大为1.2GB,这些文件是AWS DMS从时间戳记中提取的年,月和日从MySQL数据库中生成的。 >

当我使用5个DPU运行时,我得到以下信息:

An error occurred while calling o114.pyWriteDynamicFrame. 
Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 142, <REDACTED>.compute.internal, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

然后我尝试加注到20DPU:

An error occurred while calling o113.pyWriteDynamicFrame. 
Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 0E....135; S3 Extended Request ID: FhR...Is=)

如何解析这样的大表?因为该表甚至不是最大的表,所以我们必须对其进行分区。

0 个答案:

没有答案