我正在尝试使用Amazon-Data-Pipeline工具将数据从Amazon S3-Cloud传输到Amazon-Redshift。
是否可以在传输数据时使用e.G.更改数据。一个SQL语句,只有SQL语句的结果才会成为Redshift的输入?
我只找到了像以下一样的复制命令:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
来源:http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
答案 0 :(得分:5)
是的,有可能。它有两种方法:
transformSQL
醇>
如果转换是在记录范围内进行的,并且及时加载,则 transformSQL
非常有用,例如每天或每小时。这样,更改仅应用于批处理,而不应用于整个表。
以下是文档的摘录:
transformSql:用于转换输入数据的SQL SELECT表达式。从DynamoDB或Amazon S3复制数据时,AWS Data Pipeline会创建一个名为staging的表,并最初将其加载到那里。此表中的数据用于更新目标表。如果指定了transformSql选项,则会从指定的SQL语句创建第二个临时表。然后,在最终目标表中更新来自第二个登台表的数据。因此必须在名为staging的表上运行transformSql,并且transformSql的输出模式必须与最终目标表的模式匹配。
请在下面找到transformSql的使用示例。请注意,select来自staging
表。它将有效地运行CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;
。此外,必须包含所有字段并匹配RedShift DB中的现有表。
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
script
醇>
SqlActivity允许对整个数据集进行操作,并且可以通过dependsOn
机制
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
答案 1 :(得分:0)
RedshiftCopyActivity中有一个名为&#39; transformSql&#39;的可选字段。
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
我没有亲自使用过这个,但从它的外观来看,似乎 - 你会将你的s3数据放在临时表中,而这个sql stmt将返回转换后的数据以便redshift插入。
因此,您需要列出选择中的所有字段,无论您是否正在转换该字段。
答案 2 :(得分:0)
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
所以基本上在 &#34;脚本&#34; 任何sql脚本/转换/命令Amazon Redshift SQL Commands
transformSql 很好但仅支持用于转换输入数据的SQL SELECT表达式。参考:RedshiftCopyActivity