据我所知,没有直接的UPSERT查询可以直接从Glue到Redshift执行。是否可以在胶水脚本本身中实现临时表概念?
所以我的期望是创建临时表,将其与目标表合并,最后删除它。可以在Glue脚本中实现吗?
答案 0 :(得分:2)
是的,这完全可以实现。您需要做的就是将pg8000模块导入到您的胶水工作中。 pg8000模块是python库,用于与Amazon Redshift建立连接并通过游标执行SQL查询。
Python模块参考:https://github.com/mfenniak/pg8000
然后,通过pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
连接到目标集群
然后使用Glue,s的datasink选项加载到暂存表中,然后使用pg8000游标运行upsert sql查询
>>> import pg8000
>>> conn = pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
>>> cursor = conn.cursor()
>>> cursor.execute("CREATE TEMPORARY TABLE book (id SERIAL, title TEXT)")
>>> cursor.execute("INSERT INTO TABLE final_target"))
>>> conn.commit()
您需要压缩pg8000软件包并将其放入s3存储桶中,并将其引用到“胶水作业”部分“高级选项/作业”参数下的Python库路径。
答案 1 :(得分:1)
通过将“ postactions”选项传递给JDBC接收器,可以使用Glue中的登台表在Redshift中实现upsert:
val destinationTable = "upsert_test"
val destination = s"dev_sandbox.${destinationTable}"
val staging = s"dev_sandbox.${destinationTable}_staging"
val fields = datasetDf.toDF().columns.mkString(",")
val postActions =
s"""
DELETE FROM $destination USING $staging AS S
WHERE $destinationTable.id = S.id
AND $destinationTable.date = S.date;
INSERT INTO $destination ($fields) SELECT $fields FROM $staging;
DROP TABLE IF EXISTS $staging
"""
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"postactions" -> postActions
)),
redshiftTmpDir = s"$tempDir/redshift",
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
确保用于写入Redshift的用户具有足够的权限来在登台模式中创建/删除表。
答案 2 :(得分:0)
AWS Glue支持Spark和Databricks库,因此您可以使用spark / Pyspark databricks库来覆盖表:
df.write\
.format("com.databricks.spark.redshift")\
.option("url", redshift_url)\
.option("dbtable", redshift_table)\
.option("user", user)\
.option("password", readshift_password)\
.option("aws_iam_role", redshift_copy_role)\
.option("tempdir", args["TempDir"])\
.mode("overwrite")\
.save()
每个Databricks / Spark文档:
覆盖现有表:默认情况下,该库使用 事务以执行覆盖,通过删除来实现 目标表,创建一个新的空表并追加行
您可以在here
中查看databricks文档答案 3 :(得分:0)
glueContext.write_dynamic_frame.from_jdbc_conf
函数中的connection_options
字典参数显然具有2个有趣的参数:preactions
和postactions
target_table = "my_schema.my_table"
stage_table = "my_schema.#my_table_stage_table"
pre_query = """
drop table if exists {stage_table};
create table {stage_table} as select * from {target_table} LIMIT 0;""".format(stage_table=stage_table, target_table=target_table)
post_query = """
begin;
delete from {target_table} using {stage_table} where {stage_table}.id = {target_table}.id ;
insert into {target_table} select * from {stage_table};
drop table {stage_table};
end;""".format(stage_table=stage_table, target_table=target_table)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = datasource0, catalog_connection ="test_red", redshift_tmp_dir='s3://s3path', transformation_ctx="datasink4"
connection_options = {"preactions": pre_query, "postactions": post_query,
"dbtable": stage_table, "database": "redshiftdb"})
基于https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/