在Glue中使用Relationalize时,根表中没有id

时间:2018-09-27 12:33:17

标签: amazon-web-services aws-glue

我在DynamicFrame中有一个Glue,并且正在使用Relationalize方法,该方法为我创建了3个新的动态框架; root_tableroot_table_1root_table_2

当我打印表的架构或将表插入数据库后,我注意到root_table中缺少ID,因此无法在root_table与其他表之间建立连接。 / p>

我尝试了所有可能的组合。

我缺少什么吗?

    datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
    m_df = datasource1.select(df_name)
    print "Writing to Redshift table: ", df_name
    m_df.printSchema()

    glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")

2 个答案:

答案 0 :(得分:1)

我在您的数据上使用了下面的代码(删除了导入位),并将其写入了S3。我在代码后粘贴了两个文件。在对数据运行搜寻器之后,我正在从胶水目录中读取。

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_stackoverflow", transformation_ctx = "datasource0")

dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")

for df_name in dfc.keys():
    m_df = dfc.select(df_name)
    print "Writing to S3 file: ", df_name
    datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-stackoverflow/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")

job.commit()

主表 advertisementCountry,advertiserId,amendReason,修订,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,“ commissionAmount.currency”,“ commissionSharingPublisherId”,commissionStatus,customParameters,customerCountry,declineReason,id,ipHash,mountTime, oldSaleAmount,orderRef,原始SaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate,voucherCode,voucherCoded,partition AT,123456,,false,2018-09-05T16:31:00,iPhone,“ asdsdedrfrgthyjukiloujhrdf45654565423212”,www.website.at,1.5,EUR,EUR,pending,AT,321547896,-27670654789123380,68 ,,, ,false,0,654987,,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise

另一部分交易表 id,索引,“ transactionParts.val.amount”,“ transactionParts.val.commissionAmount”,“ transactionParts.val.commissionGroupCode”,“ transactionParts.val.commissionGroupId”,“ transactionParts.val.commissionGroupName” 1,0,1.0,1.5,LEAD,654654,铅

胶水在基表中生成名为“ transactionParts”的主键列,而transactionparts表中的id是该列的外键。如您所见,它保留了原来的id列。

能否请您尝试对数据进行编码,看看它是否有效(根据您的名称更改源表名称)?首先尝试以CSV格式写入S3,以了解是否可行。请让我知道您的发现。

答案 1 :(得分:0)

这是整个代码。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db1", 
table_name = "ct_5", transformation_ctx = "datasource0")

dropnullfields3 = DropNullFields.apply(frame = datasource0, transformation_ctx = "dropnullfields3")

renameId = RenameField.apply(frame = dropnullfields3, old_name = "id", new_name = "transaction_id", transformation_ctx = "renameId")

datasource1 = Relationalize.apply(frame = renameId, name = "ds", transformation_ctx = "datasource1")

for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()

glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "dbr", "dbtable" : table_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")

以下是数据记录:

{         “ advertiserCountry”:“ AT”,         “ advertiserId”:123456,         “ amendReason”:null,         “已修改”:false,         “ clickDate”:“ 2018-09-05T16:31:00”,         “ clickDevice”:“ iPhone”,         “ clickRefs”:{             “ clickRef2”:“ asdsdedrfrgthyjukiloujhrdf45654565423212”,             “ clickRef6”:“ www.website.at”         },         “ commissionAmount”:{             “数量”:1.5,             “ currency”:“ EUR”         },         “ commissionSharingPublisherId”:null,         “ commissionStatus”:“待处理”,         “ customParameters”:null,         “ customerCountry”:“ AT”,         “ declineReason”:null,         “ id”:321547896,         “ ipHash”:“-27670654789123380”,         “ lapseTime”:68,         “ oldCommissionAmount”:null,         “ oldSaleAmount”:null,         “ orderRef”:null,         “ originalSaleAmount”:null,         “ paidToPublisher”:否,         “ paymentId”:0,         “ publisherId”:654987,         “ publisherUrl”:“”,         “ saleAmount”:{             “数量”:1.0,             “ currency”:“ EUR”         },         “ siteName”:“ https://www.site.at”,         “ transactionDate”:“ 2018-09-05T16:32:00”,         “ transactionDevice”:“ iPhone”,         “ transactionParts”:[             {                 “数量”:1.0,                 “ commissionAmount”:1.5,                 “ commissionGroupCode”:“ LEAD”,                 “ commissionGroupId”:654654,                 “ commissionGroupName”:“领导”             }         ],         “ transactionQueryId”:0,         “ type”:“ Lead”,         “ url”:“ https://www.website.at”,         “ validationDate”:null,         “ voucherCode”:null,         “ voucherCodeUsed”:否     }