我正在使用AWS Glue联接两个表。默认情况下,它执行INNER JOIN。我想做一个左外连接。我参考了AWS Glue文档,但是无法将联接类型传递给Join.apply()方法。是否可以在AWS Glue中实现这一目标?
## @type: Join
## @args: [keys1 = id, keys2 = "user_id"]
## @return: cUser
## @inputs: [frame1 = cUser0, frame2 = cUserLogins]
#cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
## @type: Join
## @args: [keys1 = id, keys2 = user_id]
## @return: datasource0
## @inputs: [frame1 = cUser, frame2 = cKKR]
datasource0 = Join.apply(frame1 = cUser0, frame2 = cKKR, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
## @type: Join
## @args: [keys1 = branch_id, keys2 = user_id]
## @return: datasource1
## @inputs: [frame1 = datasource0, frame2 = cBranch]
datasource1 = Join.apply(frame1 = datasource0, frame2 = cBranch, keys1 = "branch_id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
答案 0 :(得分:3)
当前,AWS Glue不支持LEFT和RIGHT联接。但是,我们仍然可以通过将DynamicFrame转换为DataFrame并使用join方法来实现。
这里是示例:
cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")
cUser0DF = cUser0.toDF()
cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")
cKKRDF = cKKR.toDF()
dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_outer')
答案 1 :(得分:0)
如果您导入 DynamicFrames from awsglue.dynamicframe import DynamicFrame
,
然后你可以做
dataSource2 = DynamicFrame.fromDF(datasource0.join(datasource1, (datasource0['user_id'] == datasource1['user_id']), "left"), glueContext, "dataSource2")