AWS 胶水。如何为作业书签创建复合键?

时间:2021-02-08 14:37:19

标签: aws-glue

我有一个带有表的 JDBC 源 (PostgreSQL),我想通过 Glue 获取它。

我的表有列:

id          (bigint)
name        (string)
updated_at  (timestamp)

我已经使用爬虫在 Glue 数据目录中设置了表格,设置了作业并启用了作业书签。

当我运行作业时,它会自动通过新 ID 定义新行。

但我想使用复合键 -> [ id + updated_at ]。

它将允许我检测源表中的所有更新。

我该怎么做?

AWS 文档说此功能可用 (https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html):

For JDBC sources, the following rules apply:
   * For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
   * You can specify the columns to use as bookmark keys. If you don't specify bookmark keys, AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).

我应该手动定义表吗(没有爬虫)?

谢谢!

1 个答案:

答案 0 :(得分:-1)

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "hr", table_name = "emp",
    transformation_ctx = "datasource0",
    additional_options = {
        "jobBookmarkKeys": ["empno"],
        "jobBookmarkKeysSortOrder": "asc"
    }
)