我有一个按月分区的Athena CSV表,我想使用AWS胶将此CSV转换为带天分区的拼花地板。
以下是我在雅典娜的源表,
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.csv_table (
col1 STRING,
col2 STRING,
event_time_stamp STRING
)
PARTITIONED BY (month string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\\" )
LOCATION 's3://mybucket/test/csv/
S3具有按月划分的以下结构,
s3://mybucket/test/csv/month=01/somefile.gz
s3://mybucket/test/csv/month=02/somefile.gz
我想将上述CSV格式转换为镶木地板,以便使用以下结构-按日期而不是月份进行分区,需要从event_time_stamp列中提取日期分区
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.parquet_table (
col1 STRING,
col2 STRING
)
PARTITIONED BY (day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\\" )
LOCATION 's3://mybucket/test/parquet/
我尝试使用以下脚本进行转换,但似乎不起作用。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="mydb",
table_name="csv_table",
transformation_ctx="datasource0"
)
applymapping1 = ApplyMapping.apply(
frame=datasource0,
mappings=[
("col1", "string", "col1", "string"),
("col2", "string", "col2", "string"),
("month", "string", "month", "string")
],
transformation_ctx = "applymapping1"
)
selectfields2 = SelectFields.apply(
frame=applymapping1,
paths=["col1", "col2"],
transformation_ctx="selectfields2"
)
resolvechoice3 = ResolveChoice.apply(
frame=selectfields2,
choice="MATCH_CATALOG",
database="mydb",
table_name="parquet_table",
transformation_ctx="resolvechoice3"
)
resolvechoice4 = ResolveChoice.apply(
frame=resolvechoice3,
choice="make_struct",
transformation_ctx="resolvechoice4"
)
datasink5 = glueContext.write_dynamic_frame.from_options(
frame = resolvechoice4,
connection_type="s3",
connection_options={
"path":"s3://mybucket/test/csv/",
"partitionKeys": ["day"]
},
format="parquet",
transformation_ctx=datasink5
)
job.commit()
我是GLUE的新手