在大型表上移至Redshift AWS Glue

时间:2018-09-20 13:33:01

标签: amazon-web-services connection-timeout aws-glue

我有下面的脚本将不同大小的表中的所有列(内部Oracle数据库的深度为9000万至2.5亿条记录)移至AWS Redshift。该脚本还附加了给定的几个审核列:

add_metadata1 = custom_spark_df.withColumn('line_number', F.row_number().over(Window.orderBy(lit(1))))
add_metadata2 = add_metadata1.withColumn('source_system', lit(source_system))
add_metadata3 = add_metadata2.withColumn('input_filename', lit(input_filename))
add_metadata4 = add_metadata3.withColumn('received_timestamp', lit(received_timestamp))
add_metadata5 = add_metadata4.withColumn('received_timestamp_unix', lit(received_timestamp_unix))
add_metadata6 = add_metadata5.withColumn('eff_data_date', lit(eff_data_date))

目前,这项工作的长期运行导致3-5小时后连接超时,因此从未完成:

  import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## Start - Custom block of imports ##
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import datetime 
from pyspark.sql.functions import lit
## End - Custom block of imports ##

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "metadatastore", table_name = "TableName", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("...MAPPINGS OUTLINED...")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

## Start - Custom block for creation of metadata columns ##
now = datetime.datetime.now()

##line_number = '1'
## Remember to update source_system (if needed) and input_filename
source_system = 'EDW'
input_filename = 'TableName' 
received_timestamp = datetime.datetime.strptime(now.strftime("%Y-%m-%d %H:%M:%S"), "%Y-%m-%d %H:%M:%S")

received_timestamp_unix = int((now - datetime.datetime(1970,1,1)).total_seconds())

eff_data_date = datetime.datetime.strptime(now.strftime("%Y-%m-%d"), "%Y-%m-%d").date()

## Update to the last dataframe used
## Do not forget to update write_dynamic_frame to use custom_dynamic_frame for the frame name and add schema to the dbtable name
custom_spark_df = dropnullfields3.toDF()

add_metadata1 = custom_spark_df.withColumn('line_number', F.row_number().over(Window.orderBy(lit(1))))
add_metadata2 = add_metadata1.withColumn('source_system', lit(source_system))
add_metadata3 = add_metadata2.withColumn('input_filename', lit(input_filename))
add_metadata4 = add_metadata3.withColumn('received_timestamp', lit(received_timestamp))
add_metadata5 = add_metadata4.withColumn('received_timestamp_unix', lit(received_timestamp_unix))
add_metadata6 = add_metadata5.withColumn('eff_data_date', lit(eff_data_date))

custom_dynamic_frame = DynamicFrame.fromDF(add_metadata6, glueContext, "add_metadata6")
## End - Custom block for creation of metadata columns ##

datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = custom_dynamic_frame, catalog_connection = "Redshift", connection_options = {"dbtable": "schema_name.TableName", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

如何改进此脚本以减少运行时间并允许完整执行?

1 个答案:

答案 0 :(得分:2)

同意索伦。我认为您最好创建CSV转储,将其gzip,然后将其放入s3。文件进入S3后,您也可以使用胶水将其转换为镶木地板格式。对于一次性卸载,此方法将更快。

对于将AWS Glue代码从源代码加载到S3的问题,您只需要更改代码的第二行即可完成编写。使用如下所示的内容:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = custom_dynamic_frame, connection_type = "s3", connection_options = {"path": s3_output}, format = "parquet", transformation_ctx = "datasink4")