Spark Streaming读取更改数据捕获(CDC)和合并数据的最佳方法

时间:2019-12-23 07:17:27

标签: spark-streaming spark-structured-streaming change-data-capture

我正在创建管道:

AWS RDS(Postgres)-> AWS DMS-> Kinesis-> Spark Streaming \ Spark结构化流-> S3

Kinesis接收到的JSON如下:

{
    "data": {
        "id":   48079,
        "master_user_id":   3032401,
        "is_approved":  "true",
        "reason":   "Your application has been approved",
        "created_at":   "2019-07-10T07:11:56.297559Z",
        "version":  2,
        "features_id":  48844
    },
    "metadata": {
        "timestamp":    "2019-12-19T10:09:30.423859Z",
        "record-type":  "data",
        "operation":    "load",
        "partition-key-type":   "primary-key",
        "schema-name":  "public",
        "table-name":   "assessment_decision"
    }
}

我想将所有具有3个元数据字段[ timestamp,operation,table-name ]的数据组合到一个表中,以执行进一步的聚合。 最好的方法是什么?是将其转换为DataFrame并进行合并,还是使用Json库并向 data 对象添加其他 key:value

谢谢。

0 个答案:

没有答案