AWS Datapipeline:使用管道

时间:2017-02-23 12:18:48

标签: amazon-web-services amazon-s3 amazon-redshift pipeline amazon-data-pipeline

基本上我正在尝试使用aws datapipeline和我正在关注的流程将数据从postgres传输到redshift

  1. 编写一个从postgres to s3
  2. 移动数据的管道(CopyActivity)
  3. 编写一个从s3 to redshift
  4. 移动数据的管道(RedShiftCopyActivity)

    所以在我的情况下,两者都与我写的管道完美配合,但问题是数据在Redshift数据库中是重复的

    例如,下面是名为company

    的表格中来自postgres数据库的数据

    enter image description here

    成功运行s3 to redshift(RedShiftCopyActivity)管道后,数据被复制但是重复,如下所示

    enter image description here

    下面是RedShiftCopyActivity(S3到Redshift)管道的一些定义部分

      pipeline_definition = [{
          "id":"redshift_database_instance_output",
          "name":"redshift_database_instance_output",
          "fields":[
              {
              "key" : "database",
              "refValue" : "RedshiftDatabaseId_S34X5",
              },
              {
              "key" : "primaryKeys",
              "stringValue" : "id",
              },
              {
              "key" : "type",
              "stringValue" : "RedshiftDataNode",
              },
              {
              "key" : "tableName",
              "stringValue" : "company",
              },
              {
              "key" : "schedule",
              "refValue" : "DefaultScheduleTime",
              },
              {
              "key" : "schemaName",
              "stringValue" : RedShiftSchemaName,
              },
          ]
      },
      {
          "id":"CopyS3ToRedshift",
          "name":"CopyS3ToRedshift",
          "fields":[
              {
              "key" : "output",
              "refValue" : "redshift_database_instance_output",
              },
              {
              "key" : "input",
              "refValue" : "s3_input_data",
              },
              {
              "key" : "runsOn",
              "refValue" : "ResourceId_z9RNH",
              },
              {
              "key" : "type",
              "stringValue" : "RedshiftCopyActivity",
              },
              {
              "key" : "insertMode",
              "stringValue" : "KEEP_EXISTING",
              },
              {
              "key" : "schedule",
              "refValue" : "DefaultScheduleTime",
              },
          ]
      },]
    

    因此,根据RedShitCopyActivity的文档,我们需要使用insertMode来描述数据在复制到数据库表时的行为(插入/更新/删除),如下所示

    insertMode : Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are KEEP_EXISTING, OVERWRITE_EXISTING, TRUNCATE and APPEND. KEEP_EXISTING adds new rows to the table, while leaving any existing rows unmodified. KEEP_EXISTING and OVERWRITE_EXISTING use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows, according to the information provided in Updating and inserting new data in the Amazon Redshift Database Developer Guide. TRUNCATE deletes all the data in the destination table before writing the new data. APPEND will add all records to the end of the Redshift table. APPEND does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.

    那么我的要求是什么

    1. 从postgres(现在是s3中的实际数据)复制到Redshift数据库时,如果找到已存在的行,则只需更新它
    2. 如果从s3找到新记录,则在Redshift中创建新记录
    3. 但是对我而言,即使我使用过KEEP_EXISTINGOVERWRITE_EXISTING,数据只是一遍又一遍地重复,如上面的红移数据库图片所示

      那么最后如何实现我的要求?还有任何调整或设置要添加到我的配置中吗?

      修改

      来自redshift的表(公司)定义

      enter image description here

2 个答案:

答案 0 :(得分:1)

如果要避免重复,则必须在redshift中定义主键,并将myInsertMode设置为" OVERWRITE_EXISTING"

答案 1 :(得分:0)

请查看此AWS文档,也许您可​​以在那里找到解决方案。

https://aws.amazon.com/blogs/aws/fast-easy-free-sync-rds-to-redshift/https://aws.amazon.com/blogs/aws/fast-easy-free-sync-rds-to-redshift/

使用管道将数据从Postgres移动到S3然后从S3移动到Redshift看起来非常复杂和令人沮丧。

移动数据会容易得多 直接来自您的Postgres database to Redshift,不会有数据重复的风险。

今天有许多平台可以传输数据,而不会出现“混乱和头痛”。

出于同样的原因,我使用了一个名为Alooma的工具,它可以将您在Amazon RDS上托管的Postgres数据库中的表复制到近乎实时的Redshift。