基本上我正在尝试使用aws datapipeline和我正在关注的流程将数据从postgres传输到redshift
postgres to s3
s3 to redshift
所以在我的情况下,两者都与我写的管道完美配合,但问题是数据在Redshift数据库中是重复的
例如,下面是名为company
成功运行s3 to redshift(RedShiftCopyActivity)
管道后,数据被复制但是重复,如下所示
pipeline_definition = [{
"id":"redshift_database_instance_output",
"name":"redshift_database_instance_output",
"fields":[
{
"key" : "database",
"refValue" : "RedshiftDatabaseId_S34X5",
},
{
"key" : "primaryKeys",
"stringValue" : "id",
},
{
"key" : "type",
"stringValue" : "RedshiftDataNode",
},
{
"key" : "tableName",
"stringValue" : "company",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
{
"key" : "schemaName",
"stringValue" : RedShiftSchemaName,
},
]
},
{
"id":"CopyS3ToRedshift",
"name":"CopyS3ToRedshift",
"fields":[
{
"key" : "output",
"refValue" : "redshift_database_instance_output",
},
{
"key" : "input",
"refValue" : "s3_input_data",
},
{
"key" : "runsOn",
"refValue" : "ResourceId_z9RNH",
},
{
"key" : "type",
"stringValue" : "RedshiftCopyActivity",
},
{
"key" : "insertMode",
"stringValue" : "KEEP_EXISTING",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
]
},]
因此,根据RedShitCopyActivity的文档,我们需要使用insertMode
来描述数据在复制到数据库表时的行为(插入/更新/删除),如下所示
insertMode : Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are KEEP_EXISTING, OVERWRITE_EXISTING, TRUNCATE and APPEND. KEEP_EXISTING adds new rows to the table, while leaving any existing rows unmodified. KEEP_EXISTING and OVERWRITE_EXISTING use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows, according to the information provided in Updating and inserting new data in the Amazon Redshift Database Developer Guide. TRUNCATE deletes all the data in the destination table before writing the new data. APPEND will add all records to the end of the Redshift table. APPEND does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.
那么我的要求是什么
答案 0 :(得分:1)
如果要避免重复,则必须在redshift中定义主键,并将myInsertMode设置为" OVERWRITE_EXISTING"
答案 1 :(得分:0)
请查看此AWS文档,也许您可以在那里找到解决方案。
使用管道将数据从Postgres移动到S3然后从S3移动到Redshift看起来非常复杂和令人沮丧。
移动数据会容易得多 直接来自您的Postgres database to Redshift,不会有数据重复的风险。
今天有许多平台可以传输数据,而不会出现“混乱和头痛”。
出于同样的原因,我使用了一个名为Alooma的工具,它可以将您在Amazon RDS上托管的Postgres数据库中的表复制到近乎实时的Redshift。