AWS Glue denest postgres jsonb列

时间:2019-08-21 08:33:00

标签: python pyspark aws-glue

我想将jsonb列展平为同一表中的多个目标列。我找不到内置函数来完成此任务。 Glue搜寻器将jsonb列注册为字符串。当我将数据放到s3上时,可以使用Unbox.apply()将其更改为结构。

我尝试使用Relationalize和UnnestFrame来解密jsonb列。都不起作用。 Relationalize似乎仅应用go .json文件。我不确定UnnestFrame为什么不起作用。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mycatalogdb", table_name = "sourcedb_public_tablename", transformation_ctx = "datasource0")

dfc = UnnestFrame.apply(frame = datasource0, transformation_ctx = "dfc", info="", stageThreshold=0, totalThreshold=0)

dropnullfields3 = DropNullFields.apply(frame = dfc, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://mybucket"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

给出具有以下内容的源表


+----+------------+-------------------------------------------------------+
| id |    date    |                        myjson                         |
+----+------------+-------------------------------------------------------+
|  1 | 2019-10-10 | {"url":some-url,"data":{"afield":123,"moredata":567"} |
+----+------------+-------------------------------------------------------+

我想要此输出(列名格式与表格格式无关紧要)

+----+------------+----------+-------------+---------------+
| id |    date    |   url    | data_afield | data_moredata |
+----+------------+----------+-------------+---------------+
|  1 | 2019-10-10 | some-url |         123 |           567 |
+----+------------+----------+-------------+---------------+

1 个答案:

答案 0 :(得分:0)

我最终发现,我错误地使用了Relationalize,但是Glue没有抛出错误。在交互使用SageMaker并意识到在阅读this post时,relationalize()返回了集合之后,我能够弄清楚这一点。

Relationalize可以用于包含json字段的数据帧。换句话说,数据帧不必来自纯json。