使用胶水将巨大的CSV转换为带分区的镶木地板

时间:2019-09-04 14:14:26

标签: csv amazon-s3 parquet amazon-athena aws-glue

我有一个按月分区的Athena CSV表,我想使用AWS胶将此CSV转换为带天分区的拼花地板。

以下是我在雅典娜的源表,

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.csv_table (
col1 STRING,
col2 STRING,
event_time_stamp STRING
)
PARTITIONED BY (month string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\\" )
LOCATION 's3://mybucket/test/csv/

S3具有按月划分的以下结构,

s3://mybucket/test/csv/month=01/somefile.gz
s3://mybucket/test/csv/month=02/somefile.gz

我想将上述CSV格式转换为镶木地板,以便使用以下结构-按日期而不是月份进行分区,需要从event_time_stamp列中提取日期分区

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.parquet_table (
col1 STRING,
col2 STRING
)
PARTITIONED BY (day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\\" )
LOCATION 's3://mybucket/test/parquet/

我尝试使用以下脚本进行转换,但似乎不起作用。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="csv_table", 
    transformation_ctx="datasource0"
)
applymapping1 = ApplyMapping.apply(
    frame=datasource0, 
    mappings=[
        ("col1", "string", "col1", "string"), 
        ("col2", "string", "col2", "string"), 
        ("month", "string", "month", "string")
    ], 
    transformation_ctx = "applymapping1"
)
selectfields2 = SelectFields.apply(
    frame=applymapping1, 
    paths=["col1", "col2"],
    transformation_ctx="selectfields2"
)
resolvechoice3 = ResolveChoice.apply(
    frame=selectfields2,
    choice="MATCH_CATALOG",
    database="mydb",
    table_name="parquet_table",
    transformation_ctx="resolvechoice3"
)
resolvechoice4 = ResolveChoice.apply(
    frame=resolvechoice3, 
    choice="make_struct",
    transformation_ctx="resolvechoice4"
)
datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = resolvechoice4, 
    connection_type="s3", 
    connection_options={
        "path":"s3://mybucket/test/csv/", 
        "partitionKeys": ["day"]
    }, 
    format="parquet",  
    transformation_ctx=datasink5
)
job.commit()

我是GLUE的新手

0 个答案:

没有答案