Question

据我所知，没有直接的UPSERT查询可以直接从Glue到Redshift执行。是否可以在胶水脚本本身中实现临时表概念？

所以我的期望是创建临时表，将其与目标表合并，最后删除它。可以在Glue脚本中实现吗？

Answer 1

是的，这完全可以实现。您需要做的就是将pg8000模块导入到您的胶水工作中。 pg8000模块是python库，用于与Amazon Redshift建立连接并通过游标执行SQL查询。 Python模块参考：https://github.com/mfenniak/pg8000 然后，通过pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')连接到目标集群然后使用Glue，s的datasink选项加载到暂存表中，然后使用pg8000游标运行upsert sql查询

>>> import pg8000
>>> conn = pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
>>> cursor = conn.cursor()
>>> cursor.execute("CREATE TEMPORARY TABLE book (id SERIAL, title TEXT)")
>>> cursor.execute("INSERT INTO TABLE final_target"))
>>> conn.commit()

您需要压缩pg8000软件包并将其放入s3存储桶中，并将其引用到“胶水作业”部分“高级选项/作业”参数下的Python库路径。

Answer 2

通过将“ postactions”选项传递给JDBC接收器，可以使用Glue中的登台表在Redshift中实现upsert：

val destinationTable = "upsert_test"
val destination = s"dev_sandbox.${destinationTable}"
val staging = s"dev_sandbox.${destinationTable}_staging"

val fields = datasetDf.toDF().columns.mkString(",")

val postActions =
  s"""
     DELETE FROM $destination USING $staging AS S
        WHERE $destinationTable.id = S.id
          AND $destinationTable.date = S.date;
     INSERT INTO $destination ($fields) SELECT $fields FROM $staging;
     DROP TABLE IF EXISTS $staging
  """

// Write data to staging table in Redshift
glueContext.getJDBCSink(
  catalogConnection = "redshift-glue-connections-test",
  options = JsonOptions(Map(
    "database" -> "conndb",
    "dbtable" -> staging,
    "overwrite" -> "true",
    "postactions" -> postActions
  )),
  redshiftTmpDir = s"$tempDir/redshift",
  transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)

确保用于写入Redshift的用户具有足够的权限来在登台模式中创建/删除表。

Answer 3

AWS Glue支持Spark和Databricks库，因此您可以使用spark / Pyspark databricks库来覆盖表：

df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .mode("overwrite")\
  .save()

每个Databricks / Spark文档：

覆盖现有表：默认情况下，该库使用事务以执行覆盖，通过删除来实现目标表，创建一个新的空表并追加行

您可以在here

中查看databricks文档

Answer 4

glueContext.write_dynamic_frame.from_jdbc_conf函数中的connection_options字典参数显然具有2个有趣的参数：preactions和postactions

target_table = "my_schema.my_table"
stage_table = "my_schema.#my_table_stage_table"


pre_query = """
    drop table if exists {stage_table};
    create table {stage_table} as select * from {target_table} LIMIT 0;""".format(stage_table=stage_table, target_table=target_table)

post_query = """
    begin;
    delete from {target_table} using {stage_table} where {stage_table}.id = {target_table}.id ; 
    insert into {target_table} select * from {stage_table}; 
    drop table {stage_table}; 
    end;""".format(stage_table=stage_table, target_table=target_table)

datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = datasource0, catalog_connection ="test_red", redshift_tmp_dir='s3://s3path', transformation_ctx="datasink4"
    connection_options = {"preactions": pre_query, "postactions": post_query, 
                          "dbtable": stage_table, "database": "redshiftdb"})

基于https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

从AWS Glue升级到Amazon Redshift

4 个答案: