我正在创建管道:
AWS RDS(Postgres)-> AWS DMS-> Kinesis-> Spark Streaming \ Spark结构化流-> S3
Kinesis接收到的JSON如下:
{
"data": {
"id": 48079,
"master_user_id": 3032401,
"is_approved": "true",
"reason": "Your application has been approved",
"created_at": "2019-07-10T07:11:56.297559Z",
"version": 2,
"features_id": 48844
},
"metadata": {
"timestamp": "2019-12-19T10:09:30.423859Z",
"record-type": "data",
"operation": "load",
"partition-key-type": "primary-key",
"schema-name": "public",
"table-name": "assessment_decision"
}
}
我想将所有具有3个元数据字段[ timestamp,operation,table-name ]的数据组合到一个表中,以执行进一步的聚合。 最好的方法是什么?是将其转换为DataFrame并进行合并,还是使用Json库并向 data 对象添加其他 key:value 。
谢谢。