如何使用数据管道导出具有按需配置的DynamoDB表

时间:2019-02-13 09:35:12

标签: amazon-dynamodb amazon-data-pipeline

我曾经使用名为Export DynamoDB table to S3的数据管道模板将DynamoDB表导出到文件。我最近更新了所有DynamoDB表以提供按需配置,并且该模板不再起作用。我可以肯定这是因为旧模板指定了要消耗的DynamoDB吞吐量百分比,这与按需表无关。

我尝试将旧模板导出到JSON,删除对吞吐量百分比消耗的引用,并创建一个新管道。但是,这没有成功。

有人可以建议如何将具有吞吐能力的旧式管道脚本转换为新的按需表脚本吗?

这是我原来的运行脚本:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myDDBReadThroughputRatio": "0.25",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

这是我尝试的更新失败:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

这是数据管道执行中的错误:

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java

2 个答案:

答案 0 :(得分:5)

我为此打开了与AWS的支持票。他们的反应非常全面。我将其粘贴在下面


感谢您就此问题提供帮助。

不幸的是,DynamoDB的数据管道导出/导入作业不支持DynamoDB的新按需模式[1]。

使用按需容量的表没有为读取和写入单位定义的容量。在计算管道的吞吐量时,数据管道依赖于此定义的容量。

例如,如果您有100个RCU(读取容量单位)并且管道吞吐量为0.25(25%),则有效管道吞吐量将为每秒25个读取单位(100 * 0.25)。 但是,在按需容量的情况下,RCU和WCU(写容量单位)反映为0。无论管道吞吐量值如何,计算出的有效吞吐量均为0。

当有效吞吐量小于1时,管道将不执行。

您是否需要将DynamoDB表导出到S3?

如果您仅将这些表导出用于备份目的,我建议使用DynamoDB的按需备份和还原功能(与按需容量的名称容易混淆)[2]。

请注意,按需备份不会影响表的吞吐量,并且可以在几秒钟内完成。您只需支付与备份相关的S3存储成本。 但是,客户无法直接访问这些表备份,只能将其还原到源表。如果您希望对备份数据进行分析,或将数据导入其他系统,帐户或表,则此备份方法不适合。

如果您需要使用数据管道来导出DynamoDB数据,则唯一的前进方法是将表设置为Provisioned Capacity模式。

您可以手动执行此操作,也可以使用AWS CLI命令[3]将其作为活动包含在管道中。

例如(按需也称为按请求付费模式):

$ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

-

$ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST

请注意,在禁用按需容量模式后,您需要等待24小时才能再次启用它。

===参考链接===

[1] DynamoDB按需容量(另请参阅有关不受支持的服务/工具的注释):https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand

[2] DynamoDB按需备份和还原:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html

[3] DynamoDB“更新表”的AWS CLI参考:https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

答案 1 :(得分:2)

对按需表的支持已在今年早些时候添加到DDB导出工具中:GitHub commit

我能够在S3上放置该工具的较新版本,并更新管道中的一些内容以使其正常工作:

'Games & Hobbies'
'Video Games'
'Business'
...

主要更改:

  • { "objects": [ { "output": { "ref": "S3BackupLocation" }, "input": { "ref": "DDBSourceTable" }, "maximumRetries": "2", "name": "TableBackupActivity", "step": "s3://<your-tools-bucket>/emr-dynamodb-tools-4.11.0-SNAPSHOT.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}", "id": "TableBackupActivity", "runsOn": { "ref": "EmrClusterForBackup" }, "type": "EmrActivity", "resizeClusterBeforeRunning": "true" }, { "failureAndRerunMode": "CASCADE", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "pipelineLogUri": "s3://<your-log-bucket>/", "scheduleType": "ONDEMAND", "name": "Default", "id": "Default" }, { "readThroughputPercent": "#{myDDBReadThroughputRatio}", "name": "DDBSourceTable", "id": "DDBSourceTable", "type": "DynamoDBDataNode", "tableName": "#{myDDBTableName}" }, { "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}", "name": "S3BackupLocation", "id": "S3BackupLocation", "type": "S3DataNode" }, { "name": "EmrClusterForBackup", "coreInstanceCount": "1", "coreInstanceType": "m3.xlarge", "releaseLabel": "emr-5.26.0", "masterInstanceType": "m3.xlarge", "id": "EmrClusterForBackup", "region": "#{myDDBRegion}", "type": "EmrCluster", "terminateAfter": "1 Hour" } ], "parameters": [ { "description": "Output S3 folder", "id": "myOutputS3Loc", "type": "AWS::S3::ObjectKey" }, { "description": "Source DynamoDB table name", "id": "myDDBTableName", "type": "String" }, { "default": "0.25", "watermark": "Enter value between 0.1-1.0", "description": "DynamoDB read throughput ratio", "id": "myDDBReadThroughputRatio", "type": "Double" }, { "default": "us-east-1", "watermark": "us-east-1", "description": "Region of the DynamoDB table", "id": "myDDBRegion", "type": "String" } ], "values": { "myDDBRegion": "us-west-2", "myDDBTableName": "<your table name>", "myDDBReadThroughputRatio": "0.5", "myOutputS3Loc": "s3://<your-output-bucket>/" } } 的releaseLabel更新为“ emr-5.26.0”。要获得适用于Java的AWS开发工具包的v1.11和DynamoDB连接器的v4.11.0(请参见此处的发布列表:AWS docs),需要此功能
  • 如上所述更新EmrClusterForBackup的步骤。将其指向您的* .jar版本,并将工具的类名称从TableBackupActivity更新为DynamoDbExport

希望默认模板也会更新,因此它可以直接使用。