我在DynamicFrame
中有一个Glue
,并且正在使用Relationalize
方法,该方法为我创建了3个新的动态框架; root_table
,root_table_1
和root_table_2
。
当我打印表的架构或将表插入数据库后,我注意到root_table
中缺少ID,因此无法在root_table
与其他表之间建立连接。 / p>
我尝试了所有可能的组合。
我缺少什么吗?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
答案 0 :(得分:1)
我在您的数据上使用了下面的代码(删除了导入位),并将其写入了S3。我在代码后粘贴了两个文件。在对数据运行搜寻器之后,我正在从胶水目录中读取。
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_stackoverflow", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-stackoverflow/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
主表 advertisementCountry,advertiserId,amendReason,修订,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,“ commissionAmount.currency”,“ commissionSharingPublisherId”,commissionStatus,customParameters,customerCountry,declineReason,id,ipHash,mountTime, oldSaleAmount,orderRef,原始SaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate,voucherCode,voucherCoded,partition AT,123456,,false,2018-09-05T16:31:00,iPhone,“ asdsdedrfrgthyjukiloujhrdf45654565423212”,www.website.at,1.5,EUR,EUR,pending,AT,321547896,-27670654789123380,68 ,,, ,false,0,654987,,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
另一部分交易表 id,索引,“ transactionParts.val.amount”,“ transactionParts.val.commissionAmount”,“ transactionParts.val.commissionGroupCode”,“ transactionParts.val.commissionGroupId”,“ transactionParts.val.commissionGroupName” 1,0,1.0,1.5,LEAD,654654,铅
胶水在基表中生成名为“ transactionParts”的主键列,而transactionparts表中的id是该列的外键。如您所见,它保留了原来的id列。
能否请您尝试对数据进行编码,看看它是否有效(根据您的名称更改源表名称)?首先尝试以CSV格式写入S3,以了解是否可行。请让我知道您的发现。
答案 1 :(得分:0)
这是整个代码。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db1",
table_name = "ct_5", transformation_ctx = "datasource0")
dropnullfields3 = DropNullFields.apply(frame = datasource0, transformation_ctx = "dropnullfields3")
renameId = RenameField.apply(frame = dropnullfields3, old_name = "id", new_name = "transaction_id", transformation_ctx = "renameId")
datasource1 = Relationalize.apply(frame = renameId, name = "ds", transformation_ctx = "datasource1")
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "dbr", "dbtable" : table_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
以下是数据记录:
{ “ advertiserCountry”:“ AT”, “ advertiserId”:123456, “ amendReason”:null, “已修改”:false, “ clickDate”:“ 2018-09-05T16:31:00”, “ clickDevice”:“ iPhone”, “ clickRefs”:{ “ clickRef2”:“ asdsdedrfrgthyjukiloujhrdf45654565423212”, “ clickRef6”:“ www.website.at” }, “ commissionAmount”:{ “数量”:1.5, “ currency”:“ EUR” }, “ commissionSharingPublisherId”:null, “ commissionStatus”:“待处理”, “ customParameters”:null, “ customerCountry”:“ AT”, “ declineReason”:null, “ id”:321547896, “ ipHash”:“-27670654789123380”, “ lapseTime”:68, “ oldCommissionAmount”:null, “ oldSaleAmount”:null, “ orderRef”:null, “ originalSaleAmount”:null, “ paidToPublisher”:否, “ paymentId”:0, “ publisherId”:654987, “ publisherUrl”:“”, “ saleAmount”:{ “数量”:1.0, “ currency”:“ EUR” }, “ siteName”:“ https://www.site.at”, “ transactionDate”:“ 2018-09-05T16:32:00”, “ transactionDevice”:“ iPhone”, “ transactionParts”:[ { “数量”:1.0, “ commissionAmount”:1.5, “ commissionGroupCode”:“ LEAD”, “ commissionGroupId”:654654, “ commissionGroupName”:“领导” } ], “ transactionQueryId”:0, “ type”:“ Lead”, “ url”:“ https://www.website.at”, “ validationDate”:null, “ voucherCode”:null, “ voucherCodeUsed”:否 }